jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Use ELRC and OPUS corpora directly #116

Open jowagner opened 1 year ago

jowagner commented 1 year ago

Currently, we use ELRC and OPUS corpora pre-processed for SMT, i.e. tokenised and the first word of each translation unit lowercased (unless it has been identified as a named entity). While BERT is mostly blind to tokenisation, the lowercased sentence starts may pose problems in applications such as spell checkers.

--> Replace the ELRC and OPUS corpora with fresh text extracted from the files available from ELRC and OPUS. At least for OPUS, these are easy to process XML files.