kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.25k stars 5.32k forks source link

VoxCeleb1 dataset folder structure not compatible with voxceleb recipe #2759

Open kukas opened 6 years ago

kukas commented 6 years ago

I tried to run voxceleb recipe but I encountered a problem with the VoxCeleb1 dataset. The folder structure of the dataset downloaded from the official site is not the same as the one used in the recipe.

This information is not in README.txt, I had to retrieve the original folder structure by going through the recipe code. See the outdated script.

A workaround is to merge both the dev and test parts of VoxCeleb and then flatten the internal tree structure by one level. This can be done by running following bash script:

# Merging the datasets:
mkdir -p voxceleb1_wav
cp -r vox1_dev_wav/wav/* voxceleb1_wav/
cp -r vox1_test_wav/wav/* voxceleb1_wav/

# Flattening the structure
cd voxceleb1_wav/
find -mindepth 3 -maxdepth 3 -name "*.wav" -exec sh -c 'f={};wav_name=$(basename $f);video_clip_id=$(basename ${f%/*});speaker_id=${f%/*/*};mv $f ${speaker_id}/${video_clip_id}_${wav_name};' \;

# Removing unneeded subdirectories
find -mindepth 2 -maxdepth 2 -type d -exec rm -r {} \;
danpovey commented 6 years ago

I think they changed it, but I thought we had already fixed the scripts. @david-ryan-snyder, any comment?

david-ryan-snyder commented 6 years ago

The people who distribute VoxCeleb have modified the organization and labels of the dataset a few times. We made an attempt to make it back compatible last month. If you're using the most recent version of Kaldi and it still isn't working, perhaps they've modified the dataset again.

It's going to be cumbersome to modify make_voxceleb1.pl once again to keep it working with multiple versions.

@kukas, My suggestion is to create a pull request with a new script, let's call it make_voxceleb1_v2.pl, that is a modified version of make_voxceleb1.py so that it works with the latest version of VoxCeleb1 that you downloaded. Most likely the only changes you need to make are to the path names (e.g., apparently the subdirectory voxceleb1_wav doesn't exist in the newest version). In the run.sh script, the old version can be commented out. There should be a comment next to it (in the run.sh) that explains that if you're using an older version of the dataset, try preparing your data with make_voxceleb1.pl rather than make_voxceleb1_v2.pl.

If you can't do this, I'll try to get it to it myself when I have time.

danpovey commented 6 years ago

Perhaps we could ask them to try to keep it more stable?

On Wed, Oct 3, 2018 at 4:31 PM David Snyder notifications@github.com wrote:

The people who distribute VoxCeleb have modified the organization and labels of the dataset a few times. We made an attempt to make it back compatible last month. If you're using the most recent version of Kaldi and it still isn't working, perhaps they've modified the dataset again.

It's going to be cumbersome to modify make_voxceleb1.pl once again to keep it working with multiple versions.

@kukas https://github.com/kukas, My suggestion is to create a pull request with a new script, let's call it make_voxceleb1_v2.pl, that is a modified version of make_voxceleb1.py so that it works with the latest version of VoxCeleb1 that you downloaded. Most likely the only changes you need to make are to the path names (e.g., apparently the subdirectory voxceleb1_wav doesn't exist in the newest version). In the run.sh script, the old version can be commented out. There should be a comment next to it (in the run.sh) that explains that if you're using an older version of the dataset, try preparing your data with make_voxceleb1.pl rather than make_voxceleb1_v2.pl.

If you can't do this, I'll try to get it to it myself when I have time.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2759#issuecomment-426790645, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu8WJoKguduoVzfwMaJn4SRbsAQyWks5uhR6ogaJpZM4XGjme .

david-ryan-snyder commented 6 years ago

I can, once I download the current version again and verify that things have actually changed since the last time I tried to patch this.

jeremyholleman commented 5 years ago

I just downloaded Kaldi and VoxCeleb1 and VoxCeleb2 over the last week and I'm seeing a similar problem. I'm currently trying to run voxceleb v1 speaker ID and it seems to be expecting a different structure to the data. Have others seen this recently?

I finally got past stage 0 by commenting out lines related to the VC1 database, though I'm still fighting issues that may be related to that.

I recognize that it may be impossible to track a moving target if the VoxCeleb hosts keep changing structure, but maybe make_voxceleb1.pl could include a description of the file structure it expects? Thanks, Jeremy

david-ryan-snyder commented 5 years ago

Hi @jeremyholleman, you can probably pull out the dataprep scripts from https://github.com/kaldi-asr/kaldi/pull/2983/files. We didn't merge this PR because they were adding other things that we didn't want to merge, and then I forgot about it.

If you find those new scripts work for you, please let me know on this thread.

jeremyholleman commented 5 years ago

@david-ryan-snyder Thanks! That seemed to help. It got me through stage 1 anyway. I'm still working on getting stage 3 to complete on my laptop without crashing, but I see no indication that the dataset structure is still causing problems.

One caveat: I had already mucked with the file structure of the VoxCeleb1 dataset and I manually unmucked to work with run_new.sh et al. So if run_new does not perfectly cohere with the VoxCeleb1 structure, my mucking could have masked that. But I think that my current structure matches the original structure.

danpovey commented 5 years ago

jeremy: if you have time to select from that other PR just the changes we wanted (David indicated in comments) and make a 'cleaned' version for us to merge it would be great. I think the original authors did not have time or did not want to do it.

On Thu, Feb 28, 2019 at 10:34 AM jeremyholleman notifications@github.com wrote:

@david-ryan-snyder https://github.com/david-ryan-snyder Thanks! That seemed to help. It got me through stage 1 anyway. I'm still working on getting stage 3 to complete on my laptop without crashing, but I see no indication that the dataset structure is still causing problems.

One caveat: I had already mucked with the file structure of the VoxCeleb1 dataset and I manually unmucked to work with run_new.sh et al. So if run_new does not perfectly cohere with the VoxCeleb1 structure, my mucking could have masked that. But I think that my current structure matches the original structure.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2759#issuecomment-468317753, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu0q7eD1j-ONho_lHk2Zs0_I-fKM2ks5vR_cFgaJpZM4XGjme .

jeremyholleman commented 5 years ago

I can once I have it working. Right now it errors out in stage 6, pointing to voxceleb1_test_scoring.log. I've pasted below a few relevant lines from that. It is failing to find any of the training vectors in a hash during the plda scoring. I did also add a hook in make_voxceleb1_v2.pl to stop after 300 training speakers so that the training time during this debug process would be tolerable. I'm not sure whether the plda-scoring problem is due to my mods to the process or an unaddressed mismatch with the new dataset structure. I'll update once I figure it out.

Jeremy

WARNING (ivector-plda-scoring[5.5.206~1-abfbc5]:main():ivector-plda-scoring.cc:170) Key -0cYFdtyWVds-00005 not present in training iVectors.
WARNING (ivector-plda-scoring[5.5.206~1-abfbc5]:main():ivector-plda-scoring.cc:170) Key -0cYFdtyWVds-00005 not present in training iVectors.
LOG (ivector-plda-scoring[5.5.206~1-abfbc5]:main():ivector-plda-scoring.cc:217) Processed 0 trials, 37720 had errors.
jeremyholleman commented 5 years ago

I am still seeing this problem. It works through stage 5 and then fails in stage 6. I think that it is due to a mismatch on the anonymized vs non-anonymized formats between the code and the dataset.

This thread seems to be addressing the same problem. So in addition to using the run_new.sh and make_voxceleb1_v2.pl scripts, I also changed make_voxceleb1_v2.pl as follows:

# original
system("wget -O $data_base/voxceleb1_test.txt http://www.openslr.org/resources/49/voxceleb1_test.txt");
# modified
system("wget -O $data_base/veri_test.txt http://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test.txt");

and replaced voxceleb1_test.txt with veri_test.txt in the open(TRIAL_IN ... line. The two files look like they use the same format, so I did not modify veri_test.txt. If anyone has any suggestions about what else needs to be changed to get this to run, that would be helpful.

Thanks Jeremy

david-ryan-snyder commented 5 years ago

@jeremyholleman the voxceleb directory on openslr (http://www.openslr.org/49/) also contains a meta data file that provides a mapping between the two different speaker ID formats.

Do you think it's possible to correct this issue by simply applying this map to either the utterances, or to the trials file?

jeremyholleman commented 5 years ago

@david-ryan-snyder I'm not sure. How would one do that? Are you talking about vox1_meta.csv? It's lines are of the form: id10035 Alfre_Woodard f USA dev, presumably indicating that speaker id10035 is Alfre Woodard. The keys that are missing I think may actually be utterance IDs, rather than speaker IDs . For example, the line above: WARNING (ivector-plda-scoring[5.5.206~1-abfbc5]:main():ivector-plda-scoring.cc:170) Key -0cYFdtyWVds-00005 not present in training iVectors. 0cYFdtyWVds exists as a subdirectory in the test set under speaker id10309 (Ezra Miller, FWIW).

>find . -name "*0cYFdtyWVds*"
./test/wav/id10309/0cYFdtyWVds
david-ryan-snyder commented 5 years ago

-0cYFdtyWVds-00005

It looks to me like the speaker ID was supposed to be a prefix of this utterance, but the data prep script failed somewhere.

How would one do that?

As you noticed, the first two columns provide a mapping from the two different representations of the speaker's name. Maybe it will come in handy.

david-ryan-snyder commented 5 years ago

@SoumiaJaayfer, someone made a pull request recently to fix this issue with voxceleb1. Probably your git repo is out of date.

Use this run.sh: https://github.com/kaldi-asr/kaldi/blob/master/egs/voxceleb/v2/run.sh. There's no run_new.sh script in that directory. Maybe it's a file you created.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.