Open jowagner opened 3 years ago
Further issues observed:
, Vol . 5 , No . 1 1984 lch. 26
:heavy_check_mark: bhoth úd . ' ' Na madaí
:heavy_check_mark: teicniúil le tríú 1 . Gan dochar do
:heavy_check_mark: 1 . An Fhrainc . 2 . An Iorua . 3 . An Aetóip .
:heavy_check_mark: 3. Roald Dahl . 4 . (a) Sherlock Holmes .
:heavy_check_mark: in Airteagal K. 16 . 4 . Beidh feidhm
(4th item) :heavy_check_mark: siad sin air . 1791 86 . # 5 . A dh' amharc air . .
:heavy_check_mark: For evaluating custom tokenisation/segmentation with UDPipe v2.7
en_ewt+ga_idt
, would it be worth training a VariKN
model in opusfilter
and looking at the language model's token perplexities between NCI
with custom segmentation and NCI
with UDPipe segmentation?
Or is the idea that we want to do all of the segmentation/tokenisation ourselves and just use UDPipe to see if anything unusual stands out with our custom scripts?
If you think this would be useful to measure and understand the differences, sure. I don't see though how LM perplexity can help you here. I found it useful to look at a side-by-side diff of the two outputs. I compared output before and after commit e387cbe3564 this way to check that I don't accidentally make things worse. Commands:
bzcat data/ga/sampleNCI/raw/sample-25000.txt.bz2 | ./scripts/split_tokenised_text_into_sentences_old.py --verbose > test5v.txt
bzcat data/ga/sampleNCI/raw/sample-25000.txt.bz2 | ./scripts/split_tokenised_text_into_sentences.py --verbose > test6v.txt
diff -U 26 test[56]v.txt > t5to6v.patch
kompare t5to6v.patch
Commit 87667df93c09dfa64da9a20ebd3abe21cb03b793 improves handling of quotes, brackets, ellipses and '# 5 . Abc
. Script starts getting slow and more difficult to maintain.
Further ideas based on a sample of OSCAR data that was tokenised with udpipe:
Not splitting after a closing bracket fixes many issues but also introduces a new, frequent one: In plays, actor instructions are in often in brackets and then, if we do not split after the closing brackets, the instructions will not be properly separated from the next line. The next line always starts with the name of the character in all caps followed by a colon. So this could be fixed by detecting that locally a lot of sentences start with this token sequence.
The heuristic in
split_tokenised_text_into_sentences.py
is too simplistic:' Is cuid den searmanas é . ' ar sise .
should not count as split point.1.
seem to be tokenised as two tokens in the NCI. These should not be split.IV
oriv
at the start of a sentence.Suggestion:
(a)
)DR .
(Dr.
is tokenised correctly.) :heavy_check_mark:Prof .
(Prof.
does not occur.) :heavy_check_mark:nDr .
(seems to be an inflected form ofDr.
; always followingan
) :heavy_check_mark: +Iml .