jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Improve sentence splitter for tokenised text #45

Open jowagner opened 3 years ago

jowagner commented 3 years ago

The heuristic in split_tokenised_text_into_sentences.py is too simplistic:

Suggestion:

jowagner commented 3 years ago

Further issues observed:

jbrry commented 3 years ago

For evaluating custom tokenisation/segmentation with UDPipe v2.7 en_ewt+ga_idt, would it be worth training a VariKN model in opusfilter and looking at the language model's token perplexities between NCI with custom segmentation and NCI with UDPipe segmentation?

Or is the idea that we want to do all of the segmentation/tokenisation ourselves and just use UDPipe to see if anything unusual stands out with our custom scripts?

jowagner commented 3 years ago

If you think this would be useful to measure and understand the differences, sure. I don't see though how LM perplexity can help you here. I found it useful to look at a side-by-side diff of the two outputs. I compared output before and after commit e387cbe3564 this way to check that I don't accidentally make things worse. Commands:

bzcat data/ga/sampleNCI/raw/sample-25000.txt.bz2 | ./scripts/split_tokenised_text_into_sentences_old.py --verbose > test5v.txt
bzcat data/ga/sampleNCI/raw/sample-25000.txt.bz2 | ./scripts/split_tokenised_text_into_sentences.py --verbose > test6v.txt
diff -U 26  test[56]v.txt > t5to6v.patch
kompare t5to6v.patch
jowagner commented 3 years ago

Commit 87667df93c09dfa64da9a20ebd3abe21cb03b793 improves handling of quotes, brackets, ellipses and '# 5 . Abc. Script starts getting slow and more difficult to maintain.

jowagner commented 3 years ago

Further ideas based on a sample of OSCAR data that was tokenised with udpipe:

jowagner commented 3 years ago

Not splitting after a closing bracket fixes many issues but also introduces a new, frequent one: In plays, actor instructions are in often in brackets and then, if we do not split after the closing brackets, the instructions will not be properly separated from the next line. The next line always starts with the name of the character in all caps followed by a colon. So this could be fixed by detecting that locally a lot of sentences start with this token sequence.