harvard-edge / multilingual_kws

Few-shot Keyword Spotting in Any Language and Multilingual Spoken Word Corpus
163 stars 37 forks source link

Filter out NaNs from Common Voice tsvs, distinguish between intentional "nan" in language vocabulary #9

Open mmaz opened 3 years ago

mmaz commented 3 years ago

in German, 'null' (zero) is being converted to NaN by pandas when it is the only word present in the transcript (due to single-word-target-segments data)

One option is to use filter_na=False when reading Common Voice TSVs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

however, we should also first check for truly missing values in the sentence transcription column