Closed jowagner closed 3 years ago
For BERT this shouldn't matter as we learned from re-reading the bert and roberta papers over the last weeks but we need to keep this in mind for other work, e.g. using this data for semi-supervised training of dependency parsers with tri-training.
For BERT this shouldn't matter as we learned from re-reading the bert and roberta papers over the last weeks but we need to keep this in mind for other work, e.g. using this data for semi-supervised training of dependency parsers with tri-training.
Issue #4 reports: Looking at the first 100 lines, it seems that all-caps headings and the first sentence of a section are not separated. However, re-doing the sentence splitting without the extra signals from markup in the original documents probably would produce an overall worse segmentation.