jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Language filtering for NCI? #12

Open jowagner opened 3 years ago

jowagner commented 3 years ago

Lauren's annotation of a sample of 1000 <s> segments from the .vert file, i.e. not yet split into sentences according to sentence final punctuation, indicates that about 1% of the NCI is English and about 0.6% is code-switching. 1.4% cannot be annotated out of context.

If we still want to try applying a language filter, we can choose between Ailbhe's hand-crafted filter and the machine-learning based filter in our current BERT pipeline. These could be tested using the sample annotated by Lauren.

jowagner commented 3 years ago

Experiments so far indicate that switching off the language filter does not harm and in fact may be improving results. (Statistical significance testing pending.) This probably means that having a small amount of English is no problem, or that filtering removes too much of Irish data. We decided in today's meeting not to filter the NCI.

jowagner commented 3 years ago

Re-opening as James wants to try at least one more filter threshold.

jowagner commented 3 years ago

Issue #4 reports: Looking at 3 examples of English sentences in random locations, they seem to unexpectedly occur in bursts after Irish sentences in the same document. Maybe the corpus is a snapshot of ongoing translation with incomplete parts defaulting to the source language or English. If we can confirm this, this information can be used in language classification: When the classifier is not highly confident it should go with the class of the neighbouring sentences.

jowagner commented 3 years ago

Turns out yesterday's results cannot be used due to the bug identified in issue #39.