jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Enh sentsplit #52

Closed jowagner closed 3 years ago

jowagner commented 3 years ago

Hi James,

This PR implements new ideas from issue #45 but it only changes about 280 sentence boundary decisions for the NCI, see gdrive ga_BERT > BERT_Preprocessing > NCI-comparisons. Checking 20, I found 19 to be improvements and 1 not.

So while this looks good, it will not be worthwhile re-starting running experiments. Ignore this PR for the time being if you are well into the new runs.

Joachim

jbrry commented 3 years ago

This looks to be better alright. I can merge and it will take effect on the next run if we change something, but right now I have launched the pipeline with wiki-bert-pipeline document filtering and OpusFilter filtering. If the run with OpusFilter does better, then we can do more runs with different filtering thresholds and could use these changes. That would mean changing two things at once though so it might not be good for comparisons but if we just want to release our best model possible then we know including this should help given what we have observed in the example diffs.