Lhotse processing error in librispeech recipe

ngoel17 commented 1 year ago

We noticed this error while running the icefall training on some other dataset. Did a fresh install and ran librispeech recipe and replicated the same error that seems to be triggering from Lhotse handling the data. LOG file is attached. libri.log

pzelasko commented 1 year ago

These lines are the key:

  File "/mnt/dsk1/22feb/lhotse/lhotse/features/io.py", line 765, in <listcomp>
    decompressed_chunks = [lilcom.decompress(data) for data in chunk_data]
  File "~/anaconda3/envs/k2_feb23/lib/python3.9/site-packages/lilcom/lilcom_interface.py", line 110, in decompress
    raise ValueError("Something went wrong in decompression (likely bad data): "
ValueError: Something went wrong in decompression (likely bad data): decompress_float returned 7

I think you may have corrupted data, did all the feature extraction jobs / scripts complete successfully?

ngoel17 commented 1 year ago

Yes. Feature extraction scripts ran completely and did not throw any errors. However, we get exactly the same messages on two other datasets also.

On Tue, Feb 21, 2023 at 2:10 PM Piotr Żelasko @.***> wrote:

These lines are the key:

File "/mnt/dsk1/22feb/lhotse/lhotse/features/io.py", line 765, in decompressed_chunks = [lilcom.decompress(data) for data in chunk_data] File "~/anaconda3/envs/k2_feb23/lib/python3.9/site-packages/lilcom/lilcom_interface.py", line 110, in decompress raise ValueError("Something went wrong in decompression (likely bad data): " ValueError: Something went wrong in decompression (likely bad data): decompress_float returned 7

I think you may have corrupted data, did all the feature extraction jobs / scripts complete successfully?

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/918#issuecomment-1438968368, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDHE6A577VZA5TXK7VN7S3WYUHKNANCNFSM6AAAAAAVDKLFCE . You are receiving this because you authored the thread.Message ID: @.***>

pzelasko commented 1 year ago

Hmmm, I am not sure what happened then. Here's a few long shots, maybe one of them would work:

did you upgrade the versions of lhotse or lilcom recently? Can you try with an older version?
which type of feature storage did you use? HDF5, chunky, or something else?
can you identify what % of your data is affected by this issue? e.g. try iterating the random 1000 cuts and loading the features in try/catch, if it's just a few outliers, maybe you can remove them (I could add a fault_tolerant feature loading mode for these cases).

ngoel17 commented 1 year ago

Yeah. do you have a preference for a decompression method? There is also this environment variable regarding protobuf that helps some people but probably hurt us.

I will try to see if I can find more pointers on the three suggestions. As far as I know, no updates were done on one system but not another. At the moment we are not 100% sure if the problem is bad data at the time of feature extraction or a load problem, and if its really in the data or the code.

csukuangfj commented 1 year ago

By the way, did you restart the feature extraction at some point because of some error?

s-mousmita commented 1 year ago

By the way, did you restart the feature extraction at some point because of some error?

We didnt. We ran the librispeech/ASR/prepare.sh without any modification and it did all stages in one go.

k2-fsa / icefall

Lhotse processing error in librispeech recipe #918