Due to boilerplate text, corpora produced on a domain-level by corpus_by_domain.py can contain many duplicates. There should be a script to de-duplicate this data to be optionally run at the end of the collection. Earlier boilerplate text can be useful to improve the quality of the sentence alignment.
Due to boilerplate text, corpora produced on a domain-level by
corpus_by_domain.py
can contain many duplicates. There should be a script to de-duplicate this data to be optionally run at the end of the collection. Earlier boilerplate text can be useful to improve the quality of the sentence alignment.