jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Paper: include token counts / corpus stats #103

Closed jowagner closed 2 years ago

jowagner commented 2 years ago

The reported sentence counts are not very useful to compare to other corpora as

Suggestion: Also report token counts for current data

Related:

jbrry commented 2 years ago

Table 1 has been updated to show token counts for each corpus and the overall (171.3M).

[ ] Report token counts of de-duplicated data --> now issue #105