modernmt / DataCollection

Data collection, alignment and TAUS repository
Apache License 2.0
20 stars 8 forks source link

Remove duplicates from per-domain results #17

Closed achimr closed 7 years ago

achimr commented 7 years ago

Due to boilerplate text, corpora produced on a domain-level by corpus_by_domain.py can contain many duplicates. There should be a script to de-duplicate this data to be optionally run at the end of the collection. Earlier boilerplate text can be useful to improve the quality of the sentence alignment.