Downloaded sets different than on published RESULTS

goodatlas / zeroth

Kaldi-based Korean ASR (한국어 음성인식) open-source project

Apache License 2.0

348 stars 124 forks source link

Downloaded sets different than on published RESULTS #8

Closed feddybear closed 5 years ago

feddybear commented 5 years ago

NOTE: I'm referring to the RESULTS file on the current Kaldi commit, not goodatleas/zeroth

Hi, I tried running the provided recipes for zeroth_korean on kaldi. I didn't change anything on the scripts but when I looked at the evaluations, I found that the test set is only giving me 6641 "words" (individual tokens separated by space). The RESULTS text file however, shows that there are 9253 words. What do you think is the issue why there is a discrepancy with the data I'm processing?

Another insight on my current setup, when I reached nnet3's common procedures for ivector extraction, I first tried turning off speed perturbation. Once it reaches the point where the training data is being subset to 30,000 utterances, it gave an error because the training set (train_clean) is only 22263 files. Maybe there's something I'm missing?

jty016 commented 5 years ago

Hi, Korean is agglutinative language, so all the text should be analyzed by subword (morpheme). As a raw testset text file has 6641 word, but after segmented by local/updateSegmentation.sh, the number of morphemes will be around 92XX.

If you are using data from openslr, training data is 22263 files. To get 30k subset, you should do speed perturbation. the training set will be tripled and you can get 30k of perturbed data.

feddybear commented 5 years ago

Hi Lucas,

Is there any reason why the segmentation procedure is not part of the current "run" scripts' pipelines? As I mentioned, I didn't change anything in the scripts except for some flags (e.g. speed_perturb). Should it not be part of the score scripts? Maybe I'm missing a flag to do the tokenization somewhere?

Regarding the speed perturbation, yes I understand that with speed perturbation I can get 3x more. But I was just wondering about having the subsetting to 30,000 step there even when perturbation is turned off, that maybe there's a larger dataset I'm missing.

feddybear commented 5 years ago

Oh, it's my bad. I was running the run script stages on a per copy-and-paste basis and I didn't realize I was skipping the update segmentation stage. Stupid mistake! Sorry for the trouble!