lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
944 stars 216 forks source link

Re-dumped features with lilcom chunky writer... error on epoch 1 #569

Closed danpovey closed 2 years ago

danpovey commented 2 years ago

I got an error from the dataloader when running the librispeech-100h stateless_transducer recipe after dumping features with LilcomChunkWriter, but this did not happen on the 1st epoch. The error seems to be caused by a Musan cut with a very short duration: search for duration=0.004, below. [Edit: I looked for this very short cut in the Musan features manifest, but didn't find it. It seems it is generated somehow during data processing.]

The actual issue seems to occur at io.py:607:

        # Read, decode, concat                                                                                                                                                                                          
        arr = np.concatenate(
            [
                lilcom.decompress(data.tobytes())
                for data in self.hdf[key][left_chunk_idx:right_chunk_idx]
            ],
            axis=0,
        )

... I think the issue is that left_chunk_idx == right_chunk_idx, since there are zero frames in this particular cut (and perhaps because the offset falls on a multiple of chunk_size?), which causes np.concatenate to die with error: ValueError: need at least one array to concatenate. There may be two problems here: (1) that we probably shouldn't allow zero-frames cuts in the first place, but (2) this code perhaps isn't handling it too well. But I suspect we'd have problems anyway with zero-length cuts, because we're bound to run into a million special cases. We may need to fix this at the recipe level?

(Pdb) ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/ceph-dan/.local/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.825a884.clean-py3.8.egg/lhotse/utils.py", line 630, in wrapper
    return fn(*args, **kwargs)
  File "/ceph-dan/.local/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.825a884.clean-py3.8.egg/lhotse/cut.py", line 941, in load_features
    feats = self.features.load(start=self.start, duration=self.duration)
  File "/ceph-dan/.local/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.825a884.clean-py3.8.egg/lhotse/features/base.py", line 474, in load
    return storage.read(
  File "/ceph-dan/.local/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.825a884.clean-py3.8.egg/lhotse/caching.py", line 70, in wrapper
    return m(*args, **kwargs)
  File "/ceph-dan/.local/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.825a884.clean-py3.8.egg/lhotse/features/io.py", line 607, in read
    arr = np.concatenate(
  File "<__array_function__ internals>", line 5, in concatenate
ValueError: need at least one array to concatenate

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ceph-dan/.local/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.825a884.clean-py3.8.egg/lhotse/utils.py", line 630, in wrapper
    return fn(*args, **kwargs)
  File "/ceph-dan/.local/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.825a884.clean-py3.8.egg/lhotse/cut.py", line 2469, in load_features
    feats = track.cut.load_features()
  File "/ceph-dan/.local/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.825a884.clean-py3.8.egg/lhotse/utils.py", line 632, in wrapper
    raise type(e)(
ValueError: need at least one array to concatenate
[extra info] When calling: MonoCut.load_features(args=(MonoCut(id='83d5bb71-43aa-4428-9dd8-fa3a7cf32775', start=190.0, duration=0.004, channel=0, supervisions=[], features=Features(type='kaldi-fbank', num_frames=100\
0, num_features=80, frame_shift=0.01, sampling_rate=16000, start=190.0, duration=10.0, storage_type='chunked_lilcom_hdf5', storage_path='data/fbank/feats_musan/feats-8.h5', storage_key='9777610f-101a-4516-8ff7-40e42\
c9e500f', recording_id=None, channels=0), recording=Recording(id='speech-us-gov-0041', sources=[AudioSource(type='file', channels=[0], source='/ceph-dan/icefall/egs/librispeech/ASR/download/musan/speech/us-gov/speec\
h-us-gov-0041.wav')], sampling_rate=16000, num_samples=9599687, duration=599.9804375, transforms=None), custom=None),) kwargs={})

During handling of the above exception, another exception occurred:

Also:

(Pdb) MonoCut(id='83d5bb71-43aa-4428-9dd8-fa3a7cf32775', start=190.0, duration=0.004, channel=0, supervisions=[], features=Features(type='kaldi-fbank', num_frames=1000, num_features=80, frame_shift=0.01, sampling_ra\
te=16000, start=190.0, duration=10.0, storage_type='chunked_lilcom_hdf5', storage_path='data/fbank/feats_musan/feats-8.h5', storage_key='9777610f-101a-4516-8ff7-40e42c9e500f', recording_id=None, channels=0), recordi\
ng=Recording(id='speech-us-gov-0041', sources=[AudioSource(type='file', channels=[0], source='/ceph-dan/icefall/egs/librispeech/ASR/download/musan/speech/us-gov/speech-us-gov-0041.wav')], sampling_rate=16000, num_sa\
mples=9599687, duration=599.9804375, transforms=None), custom=None).num_frames
0
pzelasko commented 2 years ago

It should be ok on the recipe level, see: https://github.com/k2-fsa/icefall/blob/35ecd7e5629630242d28aa35004c8394ff7b1f91/egs/librispeech/ASR/local/compute_fbank_musan.py#L79

I think the issue is in CutSet.mix, it tries to mix noise to some very small remainder of duration. I will look into it later.

danpovey commented 2 years ago

Thanks!

pzelasko commented 2 years ago

This should help https://github.com/lhotse-speech/lhotse/pull/570

csukuangfj commented 2 years ago

I also have seen this error before, though I don't have time to look into it. The following is the error log: errors.txt


using chunked writer

2022-01-26 07:54:38,547 INFO [train.py:508] (1/4) Epoch 0, batch 800, loss[loss=1.256, over 7273 frames.], tot_loss[loss=1.335, over 1434031.5878343845 frames.], batch size: 24
2022-01-26 07:54:38,547 INFO [train.py:508] (3/4) Epoch 0, batch 800, loss[loss=1.311, over 7275 frames.], tot_loss[loss=1.33, over 1434305.4018983867 frames.], batch size: 24
2022-01-26 07:54:38,548 INFO [train.py:508] (0/4) Epoch 0, batch 800, loss[loss=1.264, over 7298 frames.], tot_loss[loss=1.332, over 1434729.008382196 frames.], batch size: 24
2022-01-26 07:54:38,553 INFO [train.py:508] (2/4) Epoch 0, batch 800, loss[loss=1.232, over 7287 frames.], tot_loss[loss=1.333, over 1434293.2270573743 frames.], batch size: 24
2022-01-26 07:55:36,469 INFO [train.py:508] (3/4) Epoch 0, batch 850, loss[loss=1.118, over 7330 frames.], tot_loss[loss=1.296, over 1441180.145110653 frames.], batch size: 20
2022-01-26 07:55:36,469 INFO [train.py:508] (1/4) Epoch 0, batch 850, loss[loss=1.206, over 7334 frames.], tot_loss[loss=1.301, over 1440753.6948630$47 frames.], batch size: 20
2022-01-26 07:55:36,474 INFO [train.py:508] (2/4) Epoch 0, batch 850, loss[loss=1.22, over 7325 frames.], tot_loss[loss=1.297, over 1440973.488883005 frames.], batch size: 20
2022-01-26 07:55:36,484 INFO [train.py:508] (0/4) Epoch 0, batch 850, loss[loss=1.234, over 7319 frames.], tot_loss[loss=1.297, over 1441594.289510975 frames.], batch size: 20

Traceback (most recent call last):
  File "./transducer_stateless/train.py", line 750, in <module>
    main()
  File "./transducer_stateless/train.py", line 741, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/ceph-fj/fangjun/open-source-2/icefall-modified-attention/egs/librispeech/ASR/transducer_stateless/train.py", line 668, in run
    train_one_epoch(
  File "/ceph-fj/fangjun/open-source-2/icefall-modified-attention/egs/librispeech/ASR/transducer_stateless/train.py", line 485, in train_one_epoch
    for batch_idx, batch in enumerate(train_dl):
  File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/utils.py", line 587, in wrapper
    return fn(*args, **kwargs)
  File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/cut.py", line 931, in load_features
    feats = self.features.load(start=self.start, duration=self.duration)
  File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/features/base.py", line 458, in load
    return storage.read(
  File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/caching.py", line 70, in wrapper
    return m(*args, **kwargs)
  File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/features/io.py", line 591, in read
    arr = np.concatenate(
  File "<__array_function__ internals>", line 5, in concatenate
ValueError: need at least one array to concatenate

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/utils.py", line 587, in wrapper
    return fn(*args, **kwargs)
  File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/cut.py", line 2306, in load_features
    feats=track.cut.load_features(),
  File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/utils.py", line 589, in wrapper
    raise type(e)(
ValueError: need at least one array to concatenate
[extra info] When calling: MonoCut.load_features(args=(MonoCut(id='b7f40908-1e74-41f1-af8b-f28805e19c73', start=140.0, duration=0.004625, channel=0,supervisions=[], features=Features(type='kaldi-fbank', num_frames=1000, num_features=80, frame_shift=0.01, sampling_rate=16000, start=140.0, duration=10.0, storage_type='chunked_lilcom_hdf5', storage_path='data/fbank/feats_musan/feats-12.h5', storage_key='8dc20df8-0173-44c9-b945-c87540e36af7', recording_id=None, channels=0), recording=Recording(id='music-rfm-0045', sources=[AudioSource(type='file', channels=[0], source='/ceph-fj/fangjun/open-source-2/icefall-master/egs/librispeech/ASR/download/musan/music/rfm/music-rfm-0045.wav')], sampling_rate=16000, num_samples=2908996, duration=181.81225, transforms=None), custom=None),) kwargs={})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = self.dataset[possibly_batched_index]
  File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/dataset/speech_recognition.py", line 105, in __getitem__
    inputs, _ = self.input_strategy(cuts)
  File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/dataset/input_strategies.py", line 110, in __call__
    return collate_features(cuts, executor=_get_executor(self.num_workers))
  File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/dataset/collation.py", line 136, in collate_features
    features[idx] = _read_features(cut)
  File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/dataset/collation.py", line 416, in _read_features
    return torch.from_numpy(cut.load_features())
pzelasko commented 2 years ago

I fixed the addition of too short cuts in the mix method, and also fixed the handling of empty arrays in various places in the codebase.