Closed svenha closed 6 years ago
personally I don't think multilingual texts are a big issue (modern German texts tend to contain many English expressions anyway) - actually I am considering building multilingual (at least English+German) models soon.
However, if you are looking for a clean way to filter out files and/or sentences I guess we'd have to consider some sort of (generic) filtering infrastructure - maybe just plain text files with prompts or filenames to exclude could help (we'd probably want to have some sort of globbing/regular expressions in there as well)
Yes, it's not a big issue; currently, my method removes only 1.5 % from parole_de.txt.
A multilingual model is an interesting idea and could be compared to an approach that feeds each input into the English system and the German system (in parallel) and lets the scores decide which result to prefer.
ok, I will close this issue for now, then.
The file
train_all.txt
generated during the construction of the German language model contains many English sentences. Some boiler plate sentences can be removed fromdata/dst/text-corpora/
with a tinygrep
call, but I would like to exclude whole texts from parole_de because they are multilingual, e.g.geheimdienst.sgm
. My current solution is to rename such files so that their suffix is not .sgm anymore (that's what the script is looking for). Is there a cleaner approach like an exclude file?