segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
695 stars 39 forks source link

Model(s) use word capitlisation to segment #101

Closed intelliqua closed 3 months ago

intelliqua commented 1 year ago

Hi,

The models tested in English and few other languages seem to rely on capitalisation to detect sentence boundaries. On our dataset, if the captilisaton at start of target sentences are retained the F1 score is as high as .90 for certain model+style+threshold. If the sentence boundary starts are lowercased, then the best of F1 score drops to 0.3

Example: with 'wtp-bert-mini' the sentence 'We are running a test We should should get two sentences' will split but 'We are running a test we should should get two sentence' won't split.

I am not sure if this is an expected behavour or an issue.

Thanks

bminixhofer commented 1 year ago

Yes, case is not corrupted during pretraining, so the current models do rely on casing to some extent.

I may train models with case corruption, but I can't really give a timeline. If you need this, you're also very welcome to take a stab, here is the current corruption code: https://github.com/bminixhofer/wtpsplit/blob/e51232663f5a5169b077bb79c9210ba5df1e1a45/wtpsplit/utils.py#L120C1-L128.

There are some subtleties to be aware of like:

intelliqua commented 1 year ago

Thanks for the detailed response. Yes, I saw the training script and realised that the case isn't been corrupted for training. This only effects the languages that use captilaisation at start of sentences.

We are using SBD/Punctuation restoration over the output generated by ASR (Whisper, MMS, Chirp). For raw evaluation of SBD, we take the reference transcripts and run it through Stanza, remove punctuations and if the language has capitalisation, then we lower case the first word of the each sentence (excluding exceptions like "I"). This is when the difference in scores was discovered, if the lowercase is skipped the score is very high.

We started with deepmultilingualpunctuation and punctuators, both of these are xlm-roberta based. Maybe because of training, deepmultilingualpunctuation model is least effected by difference in capitalisation.

To support more languages, it was discussed to train the one of the above two models on open datasets.

It would be good to know the approximate compute resources and time needed to train the wtpsplit models?

Please feel to close the issue as this seems more like a feature request :)

Thanks.

bminixhofer commented 1 year ago

All good, we can use this issue to track the feature request ;)

Thanks for elaborating on your use case.

It would be good to know the approximate compute resources and time needed to train the wtpsplit models?

Training from scratch takes 2-5 days on a TPUv3-8, depending on the model. But for case-insensitive training it would probably make sense to start from an existing model which should be much quicker (<1 TPUv3-8 day).

The main issue here is changing the corruption function to also corrupt casing, which is also not that much work but too much for me to just do right away. Once that's implemented, I can just start a run on one of my TPUs, a couple are idle right now.

intelliqua commented 1 year ago

If I understand the code and I get time then I can try :) What do you propose for casing - Turn everything that can be lowercased to lowercase or we retain all other cases that aren't associated to true sentence boundaries for languages with casing?

bminixhofer commented 1 year ago

Sounds good! Yes, the second option - retain case as is in general, but for words following a paragraph break, stochastically swap the case some percentage of times.

intelliqua commented 1 year ago

Below is the rough flow for lower-casing the training input (not including the subtleties you had already pointed out). Concerns are: 1) It will require some lib (like spacy, stanza) to do split and pos for certain languages. Also, this processing will need to be done on text (rather than char tokens)? 2) This assumes that inference input is going to be correct cased for the language (nouns/proper nouns and other rules). This is true for models like Whisper or Chirp. Which depending on the weather, also sometimes produce proper punctuation as well. Will this model not become specific to MT by such models?

I don't know how the random swapping of the case is to be done. Is it to be done regardless of sentence boundaries? Or we retain some percentage of the sentence boundaries as correct cases?

flow-lower

bminixhofer commented 1 year ago

Those are good questions.

Regarding (1): In principle POS tagging and so on makes sense, but this would be too slow on the scale that's needed, and probably does not improve results that much, so l would not do it.

Regarding (2): Also true, there are multiple ways to do the swapping depending on if all lowercase text is expected or just text which has casing but is potentially not correctly cased. I think the following makes the most sense:

Edit: maybe a model trained on all-lowercased text is sufficient. But my feeling is that for text which does have this partial casing, you can quite a bit better with a model which does not completely discard case during training.

intelliqua commented 1 year ago

I agree that running (word) splitter and POS at this scale will be resource intensive.

Also, it makes sense to make a model that is trained on lower case, as this is a more generic solution covering more cases and it is much easier to simpler to implement.

It is difficult for me to project how much difference there will be between the all-lower and partials casing model. This data is all lower case: https://sites.google.com/view/sentence-segmentation/#h.cdqb7onnibyi

The model trained on this data: https://github.com/oliverguhr/deepmultilingualpunctuation (FullStop: Multilingual Deep Models for Punctuation Prediction works reasonably well on the MT input. It's scores are stable even if the Input is properly cased or all case is removed.

On the other-hand, https://huggingface.co/1-800-BAD-CODE/punctuation_fullstop_truecase_english performs worse with lower-cased sentence start inputs

I think, it will be good to see what happens with just lower-case trained models first.

mgoldenbe commented 1 year ago

Has there been progress on this? I work with auto-generated transcripts on YouTube. There is no capitalization there.

bminixhofer commented 1 year ago

Unfortunately at the moment my compute resources are tied up in another project. I'll let you know when I trained in the model. Depending on how strongly you need this, you may want to look into training a model yourself: https://github.com/bminixhofer/wtpsplit#reproducing-the-paper. Happy to assist if there's any issues.

markus583 commented 3 months ago

Hi, we recently introduced SaT, which strongly improves upon WtP. We specifically focused on such cases by adding corruptions to text (including case) that resemble those discussed. Overall, our new SaT models (especially -sm) should handle irregular casing & punctuation much better now. Feel free to give it a try, hope it is useful!