Excluding multilingual parole_de files

gooofy / zamia-speech

Open tools and data for cloudless automatic speech recognition

GNU Lesser General Public License v3.0

443 stars 84 forks source link

Excluding multilingual parole_de files #33

Closed svenha closed 6 years ago

svenha commented 6 years ago

The file train_all.txt generated during the construction of the German language model contains many English sentences. Some boiler plate sentences can be removed from data/dst/text-corpora/ with a tiny grep call, but I would like to exclude whole texts from parole_de because they are multilingual, e.g. geheimdienst.sgm. My current solution is to rename such files so that their suffix is not .sgm anymore (that's what the script is looking for). Is there a cleaner approach like an exclude file?

gooofy commented 6 years ago

personally I don't think multilingual texts are a big issue (modern German texts tend to contain many English expressions anyway) - actually I am considering building multilingual (at least English+German) models soon.

However, if you are looking for a clean way to filter out files and/or sentences I guess we'd have to consider some sort of (generic) filtering infrastructure - maybe just plain text files with prompts or filenames to exclude could help (we'd probably want to have some sort of globbing/regular expressions in there as well)

svenha commented 6 years ago

Yes, it's not a big issue; currently, my method removes only 1.5 % from parole_de.txt.

A multilingual model is an interesting idea and could be compared to an approach that feeds each input into the English system and the German system (in parallel) and lets the scores decide which result to prefer.

gooofy commented 6 years ago

ok, I will close this issue for now, then.