kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.31k stars 5.32k forks source link

VoxCeleb1 dataset folder structure not compatible with SITW recipe #4220

Open mmmmayi opened 4 years ago

mmmmayi commented 4 years ago

Hello Daniel,

I tried to run SITW/v2 recipt with the latest VoxCeleb dataset. However, there is a bug error reported when I run run.sh: Cannot open directory: No such file or directory at local/make_voxceleb1.pl line 56. I think it happened because of line 40 in run.sh, and seemsly a path of 'VoxCeleb1/voxceleb1_wav' should exists, but I only have 'VoxCeleb1/dev' and 'VoxCeleb1/test'. I think this because of updating of VoxCeleb. Could you please help me to update the script? Thank you

danpovey commented 4 years ago

I dunno- I know chnages in voxceleb data have been an issue but dont know if your data is the old or new one, or if this recipe is supposed to be up to date.

On Thu, Aug 13, 2020 at 11:33 PM Yi Ma notifications@github.com wrote:

Hello Daniel,

I tried to run SITW/v2 https://github.com/kaldi-asr/kaldi/tree/master/egs/sitw/v2 recipt with the latest VoxCeleb dataset. However, there is a bug error reported when I run run.sh: Cannot open directory: No such file or directory at local/ make_voxceleb1.pl line 56. I think it happened because of line 40 in run.sh, and seemsly a path of 'VoxCeleb1/voxceleb1_wav' should exists, but I only have 'VoxCeleb1/dev' and 'VoxCeleb1/test'. I think this because of updating of VoxCeleb. Could you please help me to update the script? Thank you

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4220, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5357SWXHMQ73ZMZQDSAQBWFANCNFSM4P6SGG3Q .

mmmmayi commented 4 years ago

I used the latest version. It works well with make_voxceleb1_v2.pl in voxceleb/v2

danpovey commented 4 years ago

It would be great if you could figure out how to resolve the issue using that other data-prep script and make a PR so it works for the current voxceleb but can be made to work for the older release via a commented-out command in the run.sh.

On Fri, Aug 14, 2020 at 1:07 PM Yi Ma notifications@github.com wrote:

I used the latest version. It works well with make_voxceleb1_v2.pl in voxceleb/v2

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4220#issuecomment-673886087, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7DCRVN45NRVR5AEBTSATBCXANCNFSM4P6SGG3Q .

mmmmayi commented 4 years ago

I wrote a make_voxceleb1_v2.pl for sitw according to egs/voxceleb:

if (@ARGV != 2) {
  print STDERR "Usage: $0 <path-to-voxceleb1> <path-to-data-dir>\n";
  print STDERR "e.g. $0 /export/voxceleb1 data/\n";
  exit(1);
}

($data_base, $out_dir) = @ARGV;
my $out_dir = "$out_dir/voxceleb1";

if (system("mkdir -p $out_dir") != 0) {
  die "Error making directory $out_dir";
}

# This file provides the list of speakers that overlap between SITW and VoxCeleb1.
if (! -e "$out_dir/voxceleb1_sitw_overlap.txt") {
  system("wget -O $out_dir/voxceleb1_sitw_overlap.txt http://www.openslr.org/resources/49/voxceleb1_sitw_overlap.txt");
}

if (! -e "$data_base/vox1_meta.csv") {
  system("wget -O $data_base/vox1_meta.csv http://www.openslr.org/resources/49/vox1_meta.csv");
}

# sitw_overlap contains the list of speakers that also exist in our evaluation set, SITW.
my %sitw_overlap = ();
open(OVERLAP, "<", "$out_dir/voxceleb1_sitw_overlap.txt") or die "Could not open the overlap file $out_dir/voxceleb1_sitw_overlap.txt";
while (<OVERLAP>) {
  chomp;
  my $spkr_id = $_;
  $sitw_overlap{$spkr_id} = ();
}
close(OVERLAP) or die;

open(META_IN, "<", "$data_base/vox1_meta.csv") or die "Could not open the meta data file $data_base/vox1_meta.csv";

# Also add the banned speakers to sitw_overlap using their ID format in the
# newest version of VoxCeleb.
while (<META_IN>) {
  chomp;
  my ($vox_id, $spkr_id, $gender, $nation, $set) = split;
  if (exists($sitw_overlap{$spkr_id})) {
    $sitw_overlap{$vox_id} = ();
  }
}
close(META_IN) or die;

opendir my $dh, "$data_base/wav" or die "Cannot open directory test: $!";
my @spkr_dirs = grep {-d "$data_base/wav/$_" && ! /^\.{1,2}$/} readdir($dh);
closedir $dh;

open(SPKR, ">", "$out_dir/utt2spk") or die "Could not open the output file $out_dir/utt2spk";
open(WAV, ">", "$out_dir/wav.scp") or die "Could not open the output file $out_dir/wav.scp";

foreach (@spkr_dirs) {
  my $spkr_id = $_;
  if (not exists $sitw_overlap{$spkr_id}) {
    opendir my $dh, "$data_base/wav/$spkr_id/" or die "Cannot open directory: $!";
    my @rec_dirs = grep {-d "$data_base/wav/$spkr_id/$_" && ! /^\.{1,2}$/} readdir($dh);
    closedir $dh;
    foreach (@rec_dirs) {
      my $rec_id = $_;
      opendir my $dh, "$data_base/wav/$spkr_id/$rec_id/" or die "Cannot open directory: $!";
      my @files = map{s/\.[^.]+$//;$_}grep {/\.wav$/} readdir($dh);
      closedir $dh;
      foreach (@files) {
        my $name = $_;
        my $wav = "$data_base/wav/$spkr_id/$rec_id/$name.wav";
        my $utt_id = "$spkr_id-$rec_id-$name";
        print WAV "$utt_id", " $wav", "\n";
        print SPKR "$utt_id", " $spkr_id", "\n";
      }
    }
  }
}

close(SPKR) or die;
close(WAV) or die;

if (system(
  "utils/utt2spk_to_spk2utt.pl $out_dir/utt2spk >$out_dir/spk2utt") != 0) {
  die "Error creating spk2utt file in directory $out_dir";
}
system("env LC_COLLATE=C utils/fix_data_dir.sh $out_dir");
if (system("env LC_COLLATE=C utils/validate_data_dir.sh --no-text --no-feats $out_dir") != 0) {
  die "Error validating directory $out_dir";
}

before run this script, you need to merge all of samples in dev/ and test/ into a file fold called wav

danpovey commented 4 years ago

mm. is there any way you could make a PR from it? better if it's fully automatic.

On Fri, Aug 14, 2020 at 8:49 PM Yi Ma notifications@github.com wrote:

I wrote a make_voxceleb1_v2.pl for sitw according to egs/voxceleb:

if (@ARGV != 2) { print STDERR "Usage: $0 \n"; print STDERR "e.g. $0 /export/voxceleb1 data/\n"; exit(1); }

($data_base, $out_dir) = @ARGV; my $out_dir = "$out_dir/voxceleb1";

if (system("mkdir -p $out_dir") != 0) { die "Error making directory $out_dir"; }

This file provides the list of speakers that overlap between SITW and VoxCeleb1.

if (! -e "$out_dir/voxceleb1_sitw_overlap.txt") { system("wget -O $out_dir/voxceleb1_sitw_overlap.txt http://www.openslr.org/resources/49/voxceleb1_sitw_overlap.txt"); }

if (! -e "$data_base/vox1_meta.csv") { system("wget -O $data_base/vox1_meta.csv http://www.openslr.org/resources/49/vox1_meta.csv"); }

sitw_overlap contains the list of speakers that also exist in our evaluation set, SITW.

my %sitw_overlap = (); open(OVERLAP, "<", "$out_dir/voxceleb1_sitw_overlap.txt") or die "Could not open the overlap file $out_dir/voxceleb1_sitw_overlap.txt"; while () { chomp; my $spkrid = $; $sitw_overlap{$spkr_id} = (); } close(OVERLAP) or die;

open(META_IN, "<", "$data_base/vox1_meta.csv") or die "Could not open the meta data file $data_base/vox1_meta.csv";

Also add the banned speakers to sitw_overlap using their ID format in the

newest version of VoxCeleb.

while () { chomp; my ($vox_id, $spkr_id, $gender, $nation, $set) = split; if (exists($sitw_overlap{$spkr_id})) { $sitw_overlap{$vox_id} = (); } } close(META_IN) or die;

opendir my $dh, "$data_base/wav" or die "Cannot open directory test: $!"; my @spkr_dirs = grep {-d "$database/wav/$" && ! /^.{1,2}$/} readdir($dh); closedir $dh;

open(SPKR, ">", "$out_dir/utt2spk") or die "Could not open the output file $out_dir/utt2spk"; open(WAV, ">", "$out_dir/wav.scp") or die "Could not open the output file $out_dir/wav.scp";

foreach (@spkr_dirs) { my $spkrid = $; if (not exists $sitw_overlap{$spkr_id}) { opendir my $dh, "$data_base/wav/$spkr_id/" or die "Cannot open directory: $!"; my @rec_dirs = grep {-d "$data_base/wav/$spkrid/$" && ! /^.{1,2}$/} readdir($dh); closedir $dh; foreach (@rec_dirs) { my $recid = $; opendir my $dh, "$data_base/wav/$spkr_id/$recid/" or die "Cannot open directory: $!"; my @files = map{s/.[^.]+$//;$}grep {/.wav$/} readdir($dh); closedir $dh; foreach (@files) { my $name = $_; my $wav = "$data_base/wav/$spkr_id/$rec_id/$name.wav"; my $utt_id = "$spkr_id-$rec_id-$name"; print WAV "$utt_id", " $wav", "\n"; print SPKR "$utt_id", " $spkr_id", "\n"; } } } }

close(SPKR) or die; close(WAV) or die;

if (system( "utils/utt2spk_to_spk2utt.pl $out_dir/utt2spk >$out_dir/spk2utt") != 0) { die "Error creating spk2utt file in directory $out_dir"; } system("env LC_COLLATE=C utils/fix_data_dir.sh $out_dir"); if (system("env LC_COLLATE=C utils/validate_data_dir.sh --no-text --no-feats $out_dir") != 0) { die "Error validating directory $out_dir"; }

before run this script, you need to merge all of samples in dev/ and test/ into a file fold called wav

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4220#issuecomment-674058040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2BPIH5N43DDB7NAYTSAUXFJANCNFSM4P6SGG3Q .

stale[bot] commented 4 years ago

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.