Currently, we use ELRC and OPUS corpora pre-processed for SMT, i.e. tokenised and the first word of each translation unit lowercased (unless it has been identified as a named entity). While BERT is mostly blind to tokenisation, the lowercased sentence starts may pose problems in applications such as spell checkers.
--> Replace the ELRC and OPUS corpora with fresh text extracted from the files available from ELRC and OPUS. At least for OPUS, these are easy to process XML files.
Currently, we use ELRC and OPUS corpora pre-processed for SMT, i.e. tokenised and the first word of each translation unit lowercased (unless it has been identified as a named entity). While BERT is mostly blind to tokenisation, the lowercased sentence starts may pose problems in applications such as spell checkers.
--> Replace the ELRC and OPUS corpora with fresh text extracted from the files available from ELRC and OPUS. At least for OPUS, these are easy to process XML files.