Closed danpovey closed 2 years ago
It should be ok on the recipe level, see: https://github.com/k2-fsa/icefall/blob/35ecd7e5629630242d28aa35004c8394ff7b1f91/egs/librispeech/ASR/local/compute_fbank_musan.py#L79
I think the issue is in CutSet.mix, it tries to mix noise to some very small remainder of duration. I will look into it later.
Thanks!
This should help https://github.com/lhotse-speech/lhotse/pull/570
I also have seen this error before, though I don't have time to look into it. The following is the error log: errors.txt
using chunked writer
2022-01-26 07:54:38,547 INFO [train.py:508] (1/4) Epoch 0, batch 800, loss[loss=1.256, over 7273 frames.], tot_loss[loss=1.335, over 1434031.5878343845 frames.], batch size: 24
2022-01-26 07:54:38,547 INFO [train.py:508] (3/4) Epoch 0, batch 800, loss[loss=1.311, over 7275 frames.], tot_loss[loss=1.33, over 1434305.4018983867 frames.], batch size: 24
2022-01-26 07:54:38,548 INFO [train.py:508] (0/4) Epoch 0, batch 800, loss[loss=1.264, over 7298 frames.], tot_loss[loss=1.332, over 1434729.008382196 frames.], batch size: 24
2022-01-26 07:54:38,553 INFO [train.py:508] (2/4) Epoch 0, batch 800, loss[loss=1.232, over 7287 frames.], tot_loss[loss=1.333, over 1434293.2270573743 frames.], batch size: 24
2022-01-26 07:55:36,469 INFO [train.py:508] (3/4) Epoch 0, batch 850, loss[loss=1.118, over 7330 frames.], tot_loss[loss=1.296, over 1441180.145110653 frames.], batch size: 20
2022-01-26 07:55:36,469 INFO [train.py:508] (1/4) Epoch 0, batch 850, loss[loss=1.206, over 7334 frames.], tot_loss[loss=1.301, over 1440753.6948630$47 frames.], batch size: 20
2022-01-26 07:55:36,474 INFO [train.py:508] (2/4) Epoch 0, batch 850, loss[loss=1.22, over 7325 frames.], tot_loss[loss=1.297, over 1440973.488883005 frames.], batch size: 20
2022-01-26 07:55:36,484 INFO [train.py:508] (0/4) Epoch 0, batch 850, loss[loss=1.234, over 7319 frames.], tot_loss[loss=1.297, over 1441594.289510975 frames.], batch size: 20
Traceback (most recent call last):
File "./transducer_stateless/train.py", line 750, in <module>
main()
File "./transducer_stateless/train.py", line 741, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/ceph-fj/fangjun/open-source-2/icefall-modified-attention/egs/librispeech/ASR/transducer_stateless/train.py", line 668, in run
train_one_epoch(
File "/ceph-fj/fangjun/open-source-2/icefall-modified-attention/egs/librispeech/ASR/transducer_stateless/train.py", line 485, in train_one_epoch
for batch_idx, batch in enumerate(train_dl):
File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
raise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/utils.py", line 587, in wrapper
return fn(*args, **kwargs)
File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/cut.py", line 931, in load_features
feats = self.features.load(start=self.start, duration=self.duration)
File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/features/base.py", line 458, in load
return storage.read(
File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/caching.py", line 70, in wrapper
return m(*args, **kwargs)
File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/features/io.py", line 591, in read
arr = np.concatenate(
File "<__array_function__ internals>", line 5, in concatenate
ValueError: need at least one array to concatenate
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/utils.py", line 587, in wrapper
return fn(*args, **kwargs)
File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/cut.py", line 2306, in load_features
feats=track.cut.load_features(),
File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/utils.py", line 589, in wrapper
raise type(e)(
ValueError: need at least one array to concatenate
[extra info] When calling: MonoCut.load_features(args=(MonoCut(id='b7f40908-1e74-41f1-af8b-f28805e19c73', start=140.0, duration=0.004625, channel=0,supervisions=[], features=Features(type='kaldi-fbank', num_frames=1000, num_features=80, frame_shift=0.01, sampling_rate=16000, start=140.0, duration=10.0, storage_type='chunked_lilcom_hdf5', storage_path='data/fbank/feats_musan/feats-12.h5', storage_key='8dc20df8-0173-44c9-b945-c87540e36af7', recording_id=None, channels=0), recording=Recording(id='music-rfm-0045', sources=[AudioSource(type='file', channels=[0], source='/ceph-fj/fangjun/open-source-2/icefall-master/egs/librispeech/ASR/download/musan/music/rfm/music-rfm-0045.wav')], sampling_rate=16000, num_samples=2908996, duration=181.81225, transforms=None), custom=None),) kwargs={})
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/ceph-fj/fangjun/py38-1.10/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = self.dataset[possibly_batched_index]
File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/dataset/speech_recognition.py", line 105, in __getitem__
inputs, _ = self.input_strategy(cuts)
File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/dataset/input_strategies.py", line 110, in __call__
return collate_features(cuts, executor=_get_executor(self.num_workers))
File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/dataset/collation.py", line 136, in collate_features
features[idx] = _read_features(cut)
File "/ceph-fj/fangjun/open-source-2/lhotse-rnnt/lhotse/dataset/collation.py", line 416, in _read_features
return torch.from_numpy(cut.load_features())
I fixed the addition of too short cuts in the mix
method, and also fixed the handling of empty arrays in various places in the codebase.
I got an error from the dataloader when running the librispeech-100h stateless_transducer recipe after dumping features with LilcomChunkWriter, but this did not happen on the 1st epoch. The error seems to be caused by a Musan cut with a very short duration: search for duration=0.004, below. [Edit: I looked for this very short cut in the Musan features manifest, but didn't find it. It seems it is generated somehow during data processing.]
The actual issue seems to occur at io.py:607:
... I think the issue is that left_chunk_idx == right_chunk_idx, since there are zero frames in this particular cut (and perhaps because the offset falls on a multiple of chunk_size?), which causes np.concatenate to die with error:
ValueError: need at least one array to concatenate
. There may be two problems here: (1) that we probably shouldn't allow zero-frames cuts in the first place, but (2) this code perhaps isn't handling it too well. But I suspect we'd have problems anyway with zero-length cuts, because we're bound to run into a million special cases. We may need to fix this at the recipe level?Also: