Improve sentence splitter for tokenised text

jowagner commented 3 years ago

The heuristic in split_tokenised_text_into_sentences.py is too simplistic:

Full-stops in quoted text such as in ' Is cuid den searmanas é . ' ar sise . should not count as split point.
35704 "sentences" containing just the single quote character are produced.
Enumerations such as 1. seem to be tokenised as two tokens in the NCI. These should not be split.
Roman enumerations are rare, e.g. 21 cases of IV or iv at the start of a sentence.

Suggestion:

Recursively split at the best split point as long as there is a sufficiently good split point. :heavy_check_mark:
Reject split points that would result in a half
- not containing any letters :heavy_check_mark:
- with the first letter being a lowercase letter :heavy_check_mark: (+ exception for sub-enumeration e.g. (a))
- only containing a Roman number (in addition to the full-stop) :heavy_check_mark:
Reject the following split points:
- DR . (Dr. is tokenised correctly.) :heavy_check_mark:
- Prof . (Prof. does not occur.) :heavy_check_mark:
- nDr . (seems to be an inflected form of Dr.; always following an) :heavy_check_mark: + Iml .
All else being equal, preferring a split point balancing the lengths of the halves :heavy_check_mark:

jowagner commented 3 years ago

Further issues observed:

tokenisation: , Vol . 5 , No . 1 1984 lch. 26 :heavy_check_mark:
split point is between the quotes: bhoth úd . ' ' Na madaí :heavy_check_mark:
heading not split from first item of enumeration: teicniúil le tríú 1 . Gan dochar do :heavy_check_mark:
enumeration: 1 . An Fhrainc . 2 . An Iorua . 3 . An Aetóip . :heavy_check_mark:
sub-enumeration: 3. Roald Dahl . 4 . (a) Sherlock Holmes . :heavy_check_mark:
enumeration after section reference: in Airteagal K. 16 . 4 . Beidh feidhm (4th item) :heavy_check_mark:
split point: siad sin air . 1791 86 . # 5 . A dh' amharc air . . :heavy_check_mark:

jbrry commented 3 years ago

For evaluating custom tokenisation/segmentation with UDPipe v2.7 en_ewt+ga_idt, would it be worth training a VariKN model in opusfilter and looking at the language model's token perplexities between NCI with custom segmentation and NCI with UDPipe segmentation?

Or is the idea that we want to do all of the segmentation/tokenisation ourselves and just use UDPipe to see if anything unusual stands out with our custom scripts?

jowagner commented 3 years ago

If you think this would be useful to measure and understand the differences, sure. I don't see though how LM perplexity can help you here. I found it useful to look at a side-by-side diff of the two outputs. I compared output before and after commit e387cbe3564 this way to check that I don't accidentally make things worse. Commands:

bzcat data/ga/sampleNCI/raw/sample-25000.txt.bz2 | ./scripts/split_tokenised_text_into_sentences_old.py --verbose > test5v.txt
bzcat data/ga/sampleNCI/raw/sample-25000.txt.bz2 | ./scripts/split_tokenised_text_into_sentences.py --verbose > test6v.txt
diff -U 26  test[56]v.txt > t5to6v.patch
kompare t5to6v.patch

jowagner commented 3 years ago

Commit 87667df93c09dfa64da9a20ebd3abe21cb03b793 improves handling of quotes, brackets, ellipses and '# 5 . Abc. Script starts getting slow and more difficult to maintain.

jowagner commented 3 years ago

Further ideas based on a sample of OSCAR data that was tokenised with udpipe:

Add rules for abbreviations udpipe wrongly splits into two tokens, e.g. "B.Sc .", "e.g .", "Co .", "gCo .", "M.sh .", "m.sh ." and "m. sh .". (Not sure about "Sr .".)
Do not split after regex /[.?!] [:;)]/
Support initialised names where udpipe seems to wrongly separate the full-stop, e.g. "Seán T . Ó Ceallaigh".

jowagner commented 3 years ago

Not splitting after a closing bracket fixes many issues but also introduces a new, frequent one: In plays, actor instructions are in often in brackets and then, if we do not split after the closing brackets, the instructions will not be properly separated from the next line. The next line always starts with the name of the character in all caps followed by a colon. So this could be fixed by detecting that locally a lot of sentences start with this token sequence.

jbrry / Irish-BERT

Improve sentence splitter for tokenised text #45