Closed jbrry closed 3 years ago
Could we easily change the conllu text extractor to a tool that simply writes one tokenised sentence per line and then tell wikibert to not sentence-split and not tokenise the conllu corpus at all? Having conllu-style tokens in the training data probably is also good for conllu parsing.
Edit: removed English tokenisation example not relevant here
Yes previously tokenization was done immediately after converting from CoNLLU to raw text paragraphs: https://github.com/jbrry/Irish-BERT/blob/master/scripts/download_scripts/download_conll17_data.sh#L51
This was stopped due to the input line being tokenized twice so we saved the tokenization until scripts/udtokenize.py which appears to be where the problem occurs. We could revert back to the old behaviour and then just add a utility in udtokenize.py to skip the file if it begins with "conll17".
The main problem is that our BERT and ELECTRA models have already been run and I'm not sure how feasible re-training them is (without TPUs and funding for VMs and storage). Our main saving grace is that conll17 is comprised of Wikipedia articles and CommonCrawl, which we have anyway in some of the other corpora (Wikipedia and OSCAR). Also, in the BERT README, they mention:
However, you may want to intentionally add a slight amount of noise to your input data (e.g., randomly truncate 2% of input segments) to make it more robust to non-sentential input during fine-tuning.
So perhaps these mid-sentence splits might at least contribute towards exposing the model to non-sentential input.
Your last question brings us back to issue #58.
Let's keep this issue open. It's too important to close it with a wont fix label.
I wouldn't run scripts/conllu_to_text.pl
at all and instead use something like https://github.com/jowagner/mtb-tri-training/blob/master/scripts/get-conllu-text.py that preserves the tokenisation and sentence boundaries of the .conllu
file, skipping any %d-%d
and %d.%d
tokens.
conll17
is now tokenized by the above script in b9a0143c72a35d4be7a45f8213c655a3d57078f1 and is copied directly to tokenized-texts
in the wiki-bert-pipeline
(i.e. scripts/udtokenize.py will now not process these files) . Closing
The segmenter in wiki-bert-pipeline operates on a line-by-line basis which means that if the input are paragraphs of text spanning over multiple lines, the tokenizer will split at the end of the line no matter what, e.g.:
Instead of:
The tokenizer will create a new line when there is a new-line in the input paragraph.
This means that the conll17 datasets should be tokenized by itself and then skipped by the this script in the wiki-bert-pipeline. Or the
udtokenize.py
should be altered to allow for files which have sentences spanning over multiple lines.