segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
677 stars 39 forks source link

Incorrect splits #20

Closed bminixhofer closed 1 year ago

bminixhofer commented 3 years ago

Please report here issues similar to https://github.com/bminixhofer/nnsplit/issues/18, i. e. text where it is easy for humans to see the correct split but NNSplit gets it wrong.

I'm not entirely satisfied with the quality of the models yet, such cases might help improve it.

marlon-br commented 3 years ago

Hi, could you please take a look at the next split: "let me guess you're the kind of guy that ignores the rules cause it makes you feel in control am i right you're not wrong you think that's cute do you think it's cute"

the text is from random tiktok video: https://www.tiktok.com/foryou?is_copy_url=1&is_from_webapp=v2#/@lisaandlena/video/6922836710988500229

there should be something like: "Let me guess, you're the kind of guy that ignores the rules cause it makes you feel in control. Am i right. You're not wrong. You think that's cute. Do you think it's cute"

but we have "Let me guess you're the kind of guy that ignores the rules cause it makes you feel in control am. I right you're not wrong. You think that's cute do you think it's cute" Why "am" and "I" are splitted is most questionable :)

BTW, don't you think to add some more texts from other sources to model training? Because wikipedia is more about writing or academic language, not everyday speaking language.

bminixhofer commented 3 years ago

Hi, thanks for reporting this!

As you said, the issue here is likely that the model is just trained on written (mostly academic) language, not on spoken words. For example "am I right" probably barely ever occurs in Wikipedia at the start of a sentence so it makes sense it's not recognized.

texts from other sources

Do you have any specific sources in mind? I considered using text from Opus OpenSubtitles once but the issue there is that the samples are often not really once sentence (from manual inspection). Texts from OPUS could probably be made to work though with some preprocessing.

I'm open to the idea of retraining the models on more diverse sources.

marlon-br commented 3 years ago

i think the best way are social network comments, youtube comments and whatsapp\telegram etc. chats for example: https://www.kaggle.com/dolfik/russian-telegram-chats-history or https://lionbridge.ai/datasets/15-best-chatbot-datasets-for-machine-learning/ etc.

but anyway preprocessing is required

marlon-br commented 3 years ago

Hi, thanks for reporting this!

As you said, the issue here is likely that the model is just trained on written (mostly academic) language, not on spoken words. For example "am I right" probably barely ever occurs in Wikipedia at the start of a sentence so it makes sense it's not recognized.

texts from other sources

Do you have any specific sources in mind? I considered using text from Opus OpenSubtitles once but the issue there is that the samples are often not really once sentence (from manual inspection). Texts from OPUS could probably be made to work though with some preprocessing.

I'm open to the idea of retraining the models on more diverse sources.

I recently discovered https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/punctuation_and_capitalization.html from nVidia.

They use the next sources:

The model was trained with Huggingface DistilBERT base uncased checkpoint on a subset of data from the following sources:

Tatoeba sentences
Books from Project Gutenberg that were used as part of the LibriSpeech corpus
Transcripts from Fisher English Training Speech

Output is completely is expected:

"Let me guess, you're the kind of guy that ignores the rules cause it makes you feel in control. Am i right? You're not wrong? You think that's cute? Do you think it's cute?"

bminixhofer commented 3 years ago

Hi, thanks, this looks interesting as a starting point to distill further using the nnsplit models (since DistilBERT is still probably too slow).