Use ELRC and OPUS corpora directly

Currently, we use ELRC and OPUS corpora pre-processed for SMT, i.e. tokenised and the first word of each translation unit lowercased (unless it has been identified as a named entity). While BERT is mostly blind to tokenisation, the lowercased sentence starts may pose problems in applications such as spell checkers.

--> Replace the ELRC and OPUS corpora with fresh text extracted from the files available from ELRC and OPUS. At least for OPUS, these are easy to process XML files.

jbrry / Irish-BERT

Use ELRC and OPUS corpora directly #116