jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

How is the NCI sampled? #7

Open fosterjen opened 4 years ago

fosterjen commented 4 years ago

Do the sentences follow on from each other? How big are the passages that are sampled? Do we have document/passage delimiter information?

This could affect the Next Sentence Prediction task in BERT.

tlynn747 commented 4 years ago

Have shared the original .vert file with the team (sharing restrictions apply). Document, sentence and paragraph delimiters are present in original text. However it's in a column format so the first column text needs to be extracted. (Not ConLL as the 2nd column is not a lemma!) This might already be available in the first version of the file that Meghan worked on (intermediate version prior to pre-processing) - will know when we receive this.

The following paper described the interpretation of "document" https://www.tcd.ie/slscs/assets/documents/staff/nci_nlej.pdf

jowagner commented 4 years ago

Web sources were subject to boiler plate removal and de-duplication, as well as website-level rejection of low quality websites. No sampling or filtering is mentioned for the other sources. In particular, there was a target to collect 6 million Irish words of non-fiction books but they ended up including 8.4 million. This suggests they took what they could get with reasonable effort and the targets only guided where to focus efforts to collect more.

Overall, the NCI is described as having 30.2 million Irish words and 25 million Hiberno-English words. The .vert file has 33.1 million tokens, of which 3.7 million are one of the top 10 punctuation tokens, i.e. about 0.8 million tokens may be missing. It could also be just rounding: The last 5 digits in the breakdown are always one of 00000 and 50000. The XCES format used by the project for corpus delivery intersects with the fields in the .vert file's <doc> tag. There are attributes missing, such as language and biog and targetreaders, and new ones, such as origauthor and origtitle. The paper confirms that u stands for unknown. Maybe when exporting a subcorpus in sketchengine one needs to specify which attributes to export.

This comment is based on reading https://link.springer.com/article/10.1007/s10579-006-9011-7 which looks like the final version of the above nci_nlej.pdf.

jowagner commented 4 years ago

Plotting alignments as identified by (a) matching lines after removing whitespace and punctuation, (b) exact match of triplets of lines, and (c) matching token trigrams, NCI_cleaned is missing at least one large section of the .vert file, about 8%. (There may be more missing section but cannot tell as Ailbhe's cleaning script is stuck at 64%, running a few days now.) Furthermore, NCI_cleaned seems to have been produced from a wrapped version of the .vert file, i.e. cut roughly in the middle, the two halves swapped and rejoined. The alignment plot also shows some minor repetitions and sections with a markedly different token trigram distribution.

Applying Ailbhe's language and sentence length filters to the text I extracted (plus adding line breaks after each sentence ending punctuation), the filter rate is fairly constant over the corpus. (Caveat: As the corpus is big, the resolution of the alignment plot is not sufficient to notice when a sections of less than 10k sentences is dropped by the filter.) Comparing against NCI_cleaned, the missing sections are visible whether or not the filter is applied, i.e. the missing sections are not due to the language or length filter.