Closed gvoysey closed 7 years ago
@gvoysey The importer expects a flac
file and then converts the flac
file to wav
.
The conversion happens here librivox.py#L223
def _maybe_convert_wav(data_dir, extracted_data, converted_data):
source_dir = os.path.join(data_dir, extracted_data)
target_dir = os.path.join(data_dir, converted_data)
# Conditionally convert FLAC files to wav files
if not gfile.Exists(target_dir):
# Create target_dir
os.makedirs(target_dir)
# Loop over FLAC files in source_dir and convert each to wav
for root, dirnames, filenames in os.walk(source_dir):
for filename in fnmatch.filter(filenames, '*.flac'):
flac_file = os.path.join(root, filename)
wav_filename = os.path.splitext(os.path.basename(flac_file))[0] + ".wav"
wav_file = os.path.join(target_dir, wav_filename)
transformer = Transformer()
transformer.build(flac_file, wav_file)
os.remove(flac_file)
My follow-on question then is:
Are any
flac
files converted towav
?
I'm wondering if the sox
module has the correct supporting libraries on your system. In theory such support should be checked here librivox.py#L123. But, we may be seeing some strange failure mode on your system that we didn't encounter before.
@kdavis-mozilla
I'm running SoX version 14.14.1.
sox --help
lists:
AUDIO FILE FORMATS: 8svx aif aifc aiff aiffc al amb amr-nb amr-wb anb au avr awb caf cdda cdr cvs cvsd cvu dat dvms f32 f4 f64 f8 fap flac fssd gsm gsrt hcom htk ima ircam la lpc lpc10 lu mat mat4 mat5 maud nist ogg paf prc pvf raw s1 s16 s2 s24 s3 s32 s4 s8 sb sd2 sds sf sl sln smp snd sndfile sndr sndt sou sox sph sw txw u1 u16 u2 u24 u3 u32 u4 u8 ub ul uw vms voc vorbis vox w64 wav wavpcm wv wve xa xi
PLAYLIST FORMATS: m3u pls
AUDIO DEVICE DRIVERS: alsa
from the root LibriSpeech
directory, find . -name "*.wav" | wc -l
returns 236,644 wav files. A similar search for *.flac
returns 148,688 files. I don't quite have the bash chops to compare file name chunks to see if they're disjoint sets, though, so i don't know if this tells you whether some but not all conversion has happened, or not.
I have the original tarballs, so I suppose I could nuke LibriSpeech
and untar everything myself to start over. Let me know if you think that's something worth trying.
The plot thickens.
librivox initially only contains flac
files so any wav
files are the result of a sox
conversion. Thus conversion is happening. But for some reason it just stops.
I guess there are at least two options:
To just try and get things running I'd first try to re-run the code with out nuking LibriSpeech
. If that doesn't work I'd try to nuke LibriSpeech
, then run again.
To fix this bug requires a bit more detective work. I'd guess there I'd suggest to "by-hand", i.e. on the command line, try to convert the problematic flac
file to wav
. I think it's something like
kdaviss-MBP:DeepSpeech kdavis$ sox LibriSpeech/train-other-500/3547/8365/3547-8365-0023.flac 3547-8365-0023.wav
this should let us see what's going wrong. If you do this, can you add the results to the issue.
I nuked the LibriSpeech
directory last night, and then manually extracted the tarballs. The flac -> wav conversion took about 5 hours, but it did successfully complete.
During the conversion process, the only output is STARTING OPTIMIZATION
, but I could confirm that the conversion was ongoing by querying the number of flac
and wav
files from a separate terminal, and watching the ratio shift. @kdavis-mozilla would you be averse to a PR with, e.g., a progress bar for librivox? I don't know how chatty this script ought to be.
I believe I had gotten myself into a wedged state because on the attempt in which I opened this issue, i think the conversion script errored out or was killed in the middle of conversion and didn't resume properly. Creating a state where no wav files exist seems to have resolved that issue.
@gvoysey Definitely not adverse to a PR with a progress bar. We'd mentioned it internally before, but never got the time. I'm glad the issue got resolved.
The importers in general aren't very smart about resuming from a canceled preprocessing stage, but I'm not sure fixing that is worth the extra effort. I think a progress bar would be cool.
@reuben @kdavis-mozilla do you prefer a solution that requires no other external python packages, or is touching requirements.txt
OK?
@gvoysey Touching requirements.txt
is fine with me
Yep, fine with me too.
@gvoysey can this be closed now?
@reuben I think so. The fundamental issue was that the converters don't take kindly to getting interrupted as they are downloading the dataset, extracting it, or converting it. That issue's still true, but it's now more transparent to the user.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I've decided to give librivox a whirl by running
./bin/run-librivox.sh
.The downloader code downloads files whose md5s match openSLR, but the importer fails soon thereafter:
indeed, that wav file is not there. its corresponding text file,
/media/Data/Training/data/gvoysey-data/librivox/LibriSpeech/train-other-500-wav/3547-8365-0023.txt
, does exist.Is this a librivox issue? or is the importer expecting data in a different format?
The only files that match that code are: