jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Apply udpipe sentence splitter on appropriate units of text #66

Closed jbrry closed 3 years ago

jbrry commented 3 years ago

The segmenter in wiki-bert-pipeline operates on a line-by-line basis which means that if the input are paragraphs of text spanning over multiple lines, the tokenizer will split at the end of the line no matter what, e.g.:

Longfoirt Ba é Proinsias rogha an bhoird agallaimh. Tá breis agus 30 % den tír
faoi fhoraoisí. I lár an 19ú haois rinne John Windele, dinnseanchaí, tagairt do
na deilbhíní mar "Chailleacha an Chaisleáin" Bhain an tír amach an
neamhspleáchas sa bhliain 1991, agus is é Nursultan Nazarbayev atá ina Uachtarán
ón am sin anuas. Bhí frustrachas ar Pol Pot agus ar na ceannairí sinsearacha
eile ná go raibh muintir na gcathrach chomh tugtha dá sean-nósanna cosúil leis
an trádáil agus leis an ngnó.

Instead of:

jbarry@g001:~/spinning-storage/jbarry/ga_BERT/wiki-bert-pipeline/data/paracrawl_filtering_None/ga$ udpipe --tokenize --output horizontal ../../../../Irish-BERT/udpipe_models/ga_en_combined.udpipe x.txt 
Loading UDPipe model: done.
Longfoirt
Ba é Proinsias rogha an bhoird agallaimh .
Tá breis agus 30 % den tír faoi fhoraoisí .
I lár an 19ú haois rinne John Windele , dinnseanchaí , tagairt do na deilbhíní mar " Chailleacha an Chaisleáin " Bhain an tír amach an neamhspleáchas sa bhliain 1991 , agus is é Nursultan Nazarbayev atá ina Uachtarán ón am sin anuas .
Bhí frustrachas ar Pol Pot agus ar na ceannairí sinsearacha eile ná go raibh muintir na gcathrach chomh tugtha dá sean-nósanna cosúil leis an trádáil agus leis an ngnó .

The tokenizer will create a new line when there is a new-line in the input paragraph.

python ../../../scripts/udtokenize.py ../../../../Irish-BERT/udpipe_models/ga_en_combined.udpipe x.txt > x-tokenized.txt

# x-tokenized.txt

Longfoirt
Ba é Proinsias rogha an bhoird agallaimh .
Tá breis agus 30 % den tír
faoi fhoraoisí .
I lár an 19ú haois rinne John Windele , dinnseanchaí , tagairt do
na deilbhíní mar " Chailleacha an Chaisleáin " Bhain an tír amach an
neamhspleáchas sa bhliain 1991 , agus is é Nursultan Nazarbayev atá ina Uachtarán
ón am sin anuas .
Bhí frustrachas ar Pol Pot agus ar na ceannairí sinsearacha
eile ná go raibh muintir na gcathrach chomh tugtha dá sean-nósanna cosúil leis
an trádáil agus leis an ngnó .

This means that the conll17 datasets should be tokenized by itself and then skipped by the this script in the wiki-bert-pipeline. Or the udtokenize.py should be altered to allow for files which have sentences spanning over multiple lines.

jowagner commented 3 years ago

Could we easily change the conllu text extractor to a tool that simply writes one tokenised sentence per line and then tell wikibert to not sentence-split and not tokenise the conllu corpus at all? Having conllu-style tokens in the training data probably is also good for conllu parsing.

Edit: removed English tokenisation example not relevant here

jbrry commented 3 years ago

Yes previously tokenization was done immediately after converting from CoNLLU to raw text paragraphs: https://github.com/jbrry/Irish-BERT/blob/master/scripts/download_scripts/download_conll17_data.sh#L51

This was stopped due to the input line being tokenized twice so we saved the tokenization until scripts/udtokenize.py which appears to be where the problem occurs. We could revert back to the old behaviour and then just add a utility in udtokenize.py to skip the file if it begins with "conll17".

The main problem is that our BERT and ELECTRA models have already been run and I'm not sure how feasible re-training them is (without TPUs and funding for VMs and storage). Our main saving grace is that conll17 is comprised of Wikipedia articles and CommonCrawl, which we have anyway in some of the other corpora (Wikipedia and OSCAR). Also, in the BERT README, they mention:

However, you may want to intentionally add a slight amount of noise to your input data (e.g., randomly truncate 2% of input segments) to make it more robust to non-sentential input during fine-tuning.

So perhaps these mid-sentence splits might at least contribute towards exposing the model to non-sentential input.

jowagner commented 3 years ago

Your last question brings us back to issue #58.

Let's keep this issue open. It's too important to close it with a wont fix label.

I wouldn't run scripts/conllu_to_text.pl at all and instead use something like https://github.com/jowagner/mtb-tri-training/blob/master/scripts/get-conllu-text.py that preserves the tokenisation and sentence boundaries of the .conllu file, skipping any %d-%d and %d.%d tokens.

jbrry commented 3 years ago

conll17 is now tokenized by the above script in b9a0143c72a35d4be7a45f8213c655a3d57078f1 and is copied directly to tokenized-texts in the wiki-bert-pipeline (i.e. scripts/udtokenize.py will now not process these files) . Closing