jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

NCI: missing boundary between headings and first paragraph #25

Closed jowagner closed 3 years ago

jowagner commented 3 years ago

Issue #4 reports: Looking at the first 100 lines, it seems that all-caps headings and the first sentence of a section are not separated. However, re-doing the sentence splitting without the extra signals from markup in the original documents probably would produce an overall worse segmentation.

jowagner commented 3 years ago

For BERT this shouldn't matter as we learned from re-reading the bert and roberta papers over the last weeks but we need to keep this in mind for other work, e.g. using this data for semi-supervised training of dependency parsers with tri-training.

jowagner commented 3 years ago

For BERT this shouldn't matter as we learned from re-reading the bert and roberta papers over the last weeks but we need to keep this in mind for other work, e.g. using this data for semi-supervised training of dependency parsers with tri-training.