oliverguhr / fullstop-deep-punctuation-prediction

A model that predicts the punctuation of English, Italian, French and German texts.
https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large
MIT License
71 stars 13 forks source link

Complete sentence prediction #20

Open orlink opened 4 days ago

orlink commented 4 days ago

Hi Oliver, Your library is like the gift that keeps on giving. Thank you again for it. I noticed that model tends to predict a sentence ending punctuation mark at the end of the input text even if it is unlikely a complete sentence. Even the model trained for task1 and only for English is like that. It seems to be due that the tokenizer tokenizes each TSV file separately and there thousands a lot of .tsv files, including small ones, and they all end on a full stop or question mark. This seems to teach the model to almost always put a full stop at the end of the input text. I tried grouping the training data from the .tsv files into larger chunks in the ModelTrainer.to_dataset() method before passing them to the tokenizer. I've added a parameter to the method called 'min_tokenize_size' and set it to 10,000, which seems to balance predictions better, at least for task 1, English only. I plan to try it for task 2, hoping the current accuracy in other respects won't be lost. Please let me know if you have any suggestions.

oliverguhr commented 2 days ago

Wow, thanks for your effort!

I thought that I mitigated this issue by moving a "sliding text window" over the data. Most of the TSV files were too long for the model anyway. So I used a fixed sized token window and slid that over training data. I hoped that the model could not overfit on file endings because they rarely occur. However, I did not combine the training files, at least if I recall that correctly. So there are some "text windows" that contain a file ending.

If you fix this issue, I am happy to merge it :)

orlink commented 2 days ago

By sliding window you must mean the max_length and stride tokenizer settings. Yeah, I guess it still overfits because there several thousand TSV files per language. And the more languages and more epochs, the easier it seems to overfit. What I noticed is that when I increased it to four languages, the 10,000 minimum tokenize size was not enough to prevent it. So now I'm trying it with four languages and 100,000 minimum tokenize size. I think there is an optimal number that depends on the number of languages (TSV files) as well as the number of epochs, because it's still perhaps desirable for some applications, such as mine, for a bit of overfitting so the model puts a full stop at the end of some short phrases such as those used in medical records. I may end up making the parameter a function of the total number of files (items in the data array). Will keep you posted. Thanks again.