mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.07k stars 3.94k forks source link

librivox training running into file not found errors #445

Closed gvoysey closed 7 years ago

gvoysey commented 7 years ago

I've decided to give librivox a whirl by running ./bin/run-librivox.sh.

The downloader code downloads files whose md5s match openSLR, but the importer fails soon thereafter:

Traceback (most recent call last):
  File "DeepSpeech.py", line 1138, in <module>
    last_train_wer, last_dev_wer, hibernation_path = train()
  File "DeepSpeech.py", line 1037, in train
    train_context = create_execution_context('train')
  File "DeepSpeech.py", line 779, in create_execution_context
    data_set = read_data_set(set_name)
  File "DeepSpeech.py", line 742, in read_data_set
    data_sets = read_data_sets([set_name])
  File "DeepSpeech.py", line 733, in read_data_sets
    sets=set_names)
  File "/home/gvoysey/gvoysey-sandbox/DeepSpeech/util/importers/librivox.py", line 200, in read_data_sets
    train = _read_data_set(work_dir, "train-*-wav", thread_count, train_batch_size, numcep, numcontext, limit=limit_train)
  File "/home/gvoysey/gvoysey-sandbox/DeepSpeech/util/importers/librivox.py", line 278, in _read_data_set
    return DataSet(txt_files, thread_count, batch_size, numcep, numcontext)
  File "/home/gvoysey/gvoysey-sandbox/DeepSpeech/util/importers/librivox.py", line 63, in __init__
    self._files_circular_list = self._create_files_circular_list()
  File "/home/gvoysey/gvoysey-sandbox/DeepSpeech/util/importers/librivox.py", line 84, in _create_files_circular_list
    wav_file_size = os.path.getsize(wav_file)
  File "/home/gvoysey/gvoysey-sandbox/venvs/deepspeech/lib/python2.7/genericpath.py", line 49, in getsize
    return os.stat(filename).st_size
OSError: [Errno 2] No such file or directory: '/media/Data/Training/data/gvoysey-data/librivox/LibriSpeech/train-other-500-wav/3547-8365-0023.wav'

indeed, that wav file is not there. its corresponding text file, /media/Data/Training/data/gvoysey-data/librivox/LibriSpeech/train-other-500-wav/3547-8365-0023.txt, does exist.

Is this a librivox issue? or is the importer expecting data in a different format?

The only files that match that code are:

$ find . -name "3547-8365-0023*"
./LibriSpeech/train-other-500/3547/8365/3547-8365-0023.flac
./LibriSpeech/train-other-500-wav/3547-8365-0023.txt
kdavis-mozilla commented 7 years ago

@gvoysey The importer expects a flac file and then converts the flacfile to wav.

The conversion happens here librivox.py#L223

def _maybe_convert_wav(data_dir, extracted_data, converted_data):
    source_dir = os.path.join(data_dir, extracted_data)
    target_dir = os.path.join(data_dir, converted_data)

    # Conditionally convert FLAC files to wav files
    if not gfile.Exists(target_dir):
        # Create target_dir
        os.makedirs(target_dir)

        # Loop over FLAC files in source_dir and convert each to wav
        for root, dirnames, filenames in os.walk(source_dir):
            for filename in fnmatch.filter(filenames, '*.flac'):
                flac_file = os.path.join(root, filename)
                wav_filename = os.path.splitext(os.path.basename(flac_file))[0] + ".wav"
                wav_file = os.path.join(target_dir, wav_filename)
                transformer = Transformer()
                transformer.build(flac_file, wav_file)
                os.remove(flac_file)

My follow-on question then is:

Are any flac files converted to wav?

I'm wondering if the sox module has the correct supporting libraries on your system. In theory such support should be checked here librivox.py#L123. But, we may be seeing some strange failure mode on your system that we didn't encounter before.

gvoysey commented 7 years ago

@kdavis-mozilla

I'm running SoX version 14.14.1.

sox --help lists:

AUDIO FILE FORMATS: 8svx aif aifc aiff aiffc al amb amr-nb amr-wb anb au avr awb caf cdda cdr cvs cvsd cvu dat dvms f32 f4 f64 f8 fap flac fssd gsm gsrt hcom htk ima ircam la lpc lpc10 lu mat mat4 mat5 maud nist ogg paf prc pvf raw s1 s16 s2 s24 s3 s32 s4 s8 sb sd2 sds sf sl sln smp snd sndfile sndr sndt sou sox sph sw txw u1 u16 u2 u24 u3 u32 u4 u8 ub ul uw vms voc vorbis vox w64 wav wavpcm wv wve xa xi
PLAYLIST FORMATS: m3u pls
AUDIO DEVICE DRIVERS: alsa

from the root LibriSpeech directory, find . -name "*.wav" | wc -l returns 236,644 wav files. A similar search for *.flac returns 148,688 files. I don't quite have the bash chops to compare file name chunks to see if they're disjoint sets, though, so i don't know if this tells you whether some but not all conversion has happened, or not.

I have the original tarballs, so I suppose I could nuke LibriSpeech and untar everything myself to start over. Let me know if you think that's something worth trying.

kdavis-mozilla commented 7 years ago

The plot thickens.

librivox initially only contains flac files so any wav files are the result of a sox conversion. Thus conversion is happening. But for some reason it just stops.

I guess there are at least two options:

To just try and get things running I'd first try to re-run the code with out nuking LibriSpeech. If that doesn't work I'd try to nuke LibriSpeech, then run again.

To fix this bug requires a bit more detective work. I'd guess there I'd suggest to "by-hand", i.e. on the command line, try to convert the problematic flac file to wav. I think it's something like

kdaviss-MBP:DeepSpeech kdavis$ sox LibriSpeech/train-other-500/3547/8365/3547-8365-0023.flac 3547-8365-0023.wav

this should let us see what's going wrong. If you do this, can you add the results to the issue.

gvoysey commented 7 years ago

I nuked the LibriSpeech directory last night, and then manually extracted the tarballs. The flac -> wav conversion took about 5 hours, but it did successfully complete.

During the conversion process, the only output is STARTING OPTIMIZATION, but I could confirm that the conversion was ongoing by querying the number of flac and wav files from a separate terminal, and watching the ratio shift. @kdavis-mozilla would you be averse to a PR with, e.g., a progress bar for librivox? I don't know how chatty this script ought to be.

I believe I had gotten myself into a wedged state because on the attempt in which I opened this issue, i think the conversion script errored out or was killed in the middle of conversion and didn't resume properly. Creating a state where no wav files exist seems to have resolved that issue.

kdavis-mozilla commented 7 years ago

@gvoysey Definitely not adverse to a PR with a progress bar. We'd mentioned it internally before, but never got the time. I'm glad the issue got resolved.

reuben commented 7 years ago

The importers in general aren't very smart about resuming from a canceled preprocessing stage, but I'm not sure fixing that is worth the extra effort. I think a progress bar would be cool.

gvoysey commented 7 years ago

@reuben @kdavis-mozilla do you prefer a solution that requires no other external python packages, or is touching requirements.txt OK?

kdavis-mozilla commented 7 years ago

@gvoysey Touching requirements.txt is fine with me

reuben commented 7 years ago

Yep, fine with me too.

reuben commented 7 years ago

@gvoysey can this be closed now?

gvoysey commented 7 years ago

@reuben I think so. The fundamental issue was that the converters don't take kindly to getting interrupted as they are downloading the dataset, extracting it, or converting it. That issue's still true, but it's now more transparent to the user.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.