lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
957 stars 220 forks source link

how much shared memory and disk memory do i need to process the S subset of wenetspeech dataset? #1132

Open SaltedSlark opened 1 year ago

SaltedSlark commented 1 year ago

insufficient shm image insufficent disk mem? image here is my docker info: image

pzelasko commented 1 year ago

I’m guessing this is related to IPC of data loading workers for batch feat computation and could be related to too many workers/too large batches; but judging by the warning about max_duration, did you trim your cut set to supervisions? Can you show the output “lhotse cut describe cuts.jsonl.gz”? I think you might be computing features for very long cuts (and you probably don’t need this).

SaltedSlark commented 1 year ago

I’m guessing this is related to IPC of data loading workers for batch feat computation and could be related to too many workers/too large batches; but judging by the warning about max_duration, did you trim your cut set to supervisions? Can you show the output “lhotse cut describe cuts.jsonl.gz”? I think you might be computing features for very long cuts (and you probably don’t need this).

Thanks for ur reply! I revised the num_workers to 0, and this happened:

/bin/bash: /home/zj/anaconda3/envs/vall-e/lib/libtinfo.so.6: no version information available (required by /bin/bash)
2023-08-30 10:26:50 (prepare.sh:59:main) Stage 1: Prepare wenetspeech manifest
2023-08-30 10:26:50 (prepare.sh:71:main) Stage 2: Tokenize/Fbank wenetspeech
2023-08-30 10:27:06,501 INFO [tokenizer.py:160] dataset_parts: ['S'] manifests {'S': {'recordings': RecordingSet(len=43664), 'supervisions': SupervisionSet(len=151600)}}
2023-08-30 10:27:06,507 INFO [tokenizer.py:167] Processing partition: S CUDA: True
Computing features in batches:   0%|                                                      | 0/43664 [00:00<?, ?it/s]/home/zj/workspace/TTS/lhotse/lhotse/dataset/sampling/simple.py:216: UserWarning: The first cut drawn in batch collection violates the max_frames, max_cuts, or max_duration constraints - we'll return it anyway. Consider increasing max_frames/max_cuts/max_duration.
  warnings.warn(
Computing features in batches:   0%|                                                      | 0/43664 [00:14<?, ?it/s]
Traceback (most recent call last):
  File "/home/zj/workspace/TTS/vall-e/egs/wenetspeech/bin/tokenizer.py", line 268, in <module>
    main()
  File "/home/zj/workspace/TTS/vall-e/egs/wenetspeech/bin/tokenizer.py", line 204, in main
    cut_set = cut_set.compute_and_store_features_batch(
  File "/home/zj/workspace/TTS/lhotse/lhotse/cut/set.py", line 2308, in compute_and_store_features_batch
    features = extractor.extract_batch(
  File "/home/zj/workspace/TTS/vall-e/valle/data/tokenizer.py", line 348, in extract_batch
    encoded_frames = self.tokenizer.encode(samples.detach().to(device))
  File "/home/zj/workspace/TTS/vall-e/valle/data/tokenizer.py", line 239, in encode
    return self.codec.encode(wav.to(self.device))
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/model.py", line 144, in encode
    encoded_frames.append(self._encode_frame(frame))
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/model.py", line 161, in _encode_frame
    emb = self.encoder(x)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/seanet.py", line 144, in forward
    return self.model(x)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/seanet.py", line 63, in forward
    return self.shortcut(x) + self.block(x)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/conv.py", line 204, in forward
    x = pad1d(x, (padding_total, extra_padding), mode=self.pad_mode)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/conv.py", line 92, in pad1d
    padded = F.pad(x, paddings, mode, value)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.14 GiB (GPU 0; 23.65 GiB total capacity; 21.73 GiB already allocated; 104.06 MiB free; 21.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

does it mean that the recodingset or Supervisionset is too long for my gpu devices(RTX 4090 24GB)? and what should i do to avoid this?

pzelasko commented 1 year ago

Try cuts = cuts.trim_to_supervisions() before feature extraction and then you can also use multiple workers again.

SaltedSlark commented 1 year ago

Try cuts = cuts.trim_to_supervisions() before feature extraction and then you can also use multiple workers again.

thanks! like this? before: image after: image

pzelasko commented 1 year ago

Yeah

SaltedSlark commented 1 year ago

Yeah

thanks! I met another problem when I try to train my vall-e model on S subset: image I have no idea what is wrong, looking for your rely, much love!

pzelasko commented 1 year ago

Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data.

SaltedSlark commented 1 year ago

Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz image looks like features num is much smaller than cuts count? is that something wrong?and why it happend?

SaltedSlark commented 1 year ago

Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz image looks like features num is much smaller than cuts count? is that something wrong?and why it happend? I combine two sets to get the cut_train set and I found one of them has 0 feature... image

danpovey commented 1 year ago

Silence is over 90%??

On Fri, Sep 1, 2023, 11:15 AM ZhangJiang @.***> wrote:

Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz [image: image] https://user-images.githubusercontent.com/32287808/264909381-b549ca50-76fa-4259-bec8-7c886e7a2e73.png

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/1132#issuecomment-1702095120, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZC7TDRB56EJKTSG6DXYFHNHANCNFSM6AAAAAA4COGQNU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

SaltedSlark commented 1 year ago

Silence is over 90%?? On Fri, Sep 1, 2023, 11:15 AM ZhangJiang @.> wrote: Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz [image: image] https://user-images.githubusercontent.com/32287808/264909381-b549ca50-76fa-4259-bec8-7c886e7a2e73.png — Reply to this email directly, view it on GitHub <#1132 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZC7TDRB56EJKTSG6DXYFHNHANCNFSM6AAAAAA4COGQNU . You are receiving this because you are subscribed to this thread.Message ID: @.>

... looks so weird ..., and I don't know what's wrong.

danpovey commented 1 year ago

Look at the jsonl file

On Friday, September 1, 2023, ZhangJiang @.***> wrote:

Silence is over 90%?? … <#m-4835813782995112893> On Fri, Sep 1, 2023, 11:15 AM ZhangJiang @.> wrote: Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz [image: image] https://user-images.githubusercontent.com/32287808/264909381-b549ca50-76fa-4259-bec8-7c886e7a2e73.png https://user-images.githubusercontent.com/32287808/264909381-b549ca50-76fa-4259-bec8-7c886e7a2e73.png — Reply to this email directly, view it on GitHub <#1132 (comment) https://github.com/lhotse-speech/lhotse/issues/1132#issuecomment-1702095120>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZC7TDRB56EJKTSG6DXYFHNHANCNFSM6AAAAAA4COGQNU https://github.com/notifications/unsubscribe-auth/AAZFLOZC7TDRB56EJKTSG6DXYFHNHANCNFSM6AAAAAA4COGQNU . You are receiving this because you are subscribed to this thread.Message ID: @.>

... looks so weird ..., and I don't know what's wrong.

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/1132#issuecomment-1702108347, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4EBN2UXG7FLFQBVCDXYFJVTANCNFSM6AAAAAA4COGQNU . You are receiving this because you commented.Message ID: @.***>

pzelasko commented 1 year ago

looks like features num is much smaller than cuts count? is that something wrong?and why it happend? I combine two sets to get the cut_train set and I found one of them has 0 feature...

Perhaps one of the cut sets you combined did not have features computed. Also, judging by the mean duration of 1600s, you did not call .trim_to_supervisions() on this cutset.

SaltedSlark commented 1 year ago

thanks you so much!@pzelasko @danpovey I'll try.

SaltedSlark commented 1 year ago

@pzelasko As for M subset, I am sure that I've called .trim_to_supervisions as I showed. I found the Supervisions available does not match with Feature available... image and it seems to cause an validate mistake after call validate() image

Jiang-Stan commented 1 year ago

@pzelasko As for M subset, I am sure that I've called .trim_to_supervisions as I showed. I found the Supervisions available does not match with Feature available... image and it seems to cause an validate mistake after call validate() image

image Detailed description in this function mentioned that keep_overlapping would keep the number matched.

Result on S subset: image

pzelasko commented 1 year ago

You either need to use keep_overlapping=False or filter out the cuts that have overlapping speech (whichever makes sense for your use case).

Jiang-Stan commented 1 year ago

@SaltedSlark Hi, how long did you take preprocessing WenetSpeech M set? It takes me 50 minutes extracting features, but it has taken over 11 hours saving to wenetspeech_cuts_M.jsonl.gz and still not finished yet.

@pzelasko Is there any parallelization optimization for this function? I tried to preprocess WenetSpeech M set last night, and it took over 11 hours on this function and still not finished(The progress bar time cost is 50 minutes before keyboard interrupt). I have successfully preprocessed WenetSpeech S set twice with same num_workers and the time for saving is negligible, so I guess this is not a lock issue. image By applying htop, I find that only one CPU is used for saving. image

SaltedSlark commented 1 year ago

@SaltedSlark Hi, how long did you take preprocessing WenetSpeech M set? It takes me 50 minutes extracting features, but it has taken over 8 hours saving to wenetspeech_cuts_M.jsonl.gz and still not finished yet.

@pzelasko Is there any parallelization optimization for this function? I tried to preprocess WenetSpeech M set last night, and it took over 8 hours on this function and still not finished. I have successfully preprocessed WenetSpeech S set twice with same num_workers, so I guess this is not a lock issue. image

For me, it took about 80hours to process M subset... and I also want to know how to speed up!

SaltedSlark commented 1 year ago

@SaltedSlark Hi, how long did you take preprocessing WenetSpeech M set? It takes me 50 minutes extracting features, but it has taken over 8 hours saving to wenetspeech_cuts_M.jsonl.gz and still not finished yet. @pzelasko Is there any parallelization optimization for this function? I tried to preprocess WenetSpeech M set last night, and it took over 8 hours on this function and still not finished. I have successfully preprocessed WenetSpeech S set twice with same num_workers, so I guess this is not a lock issue. image

For me, it took about 80hours to process M subset... and I also want to know how to speed up!

I'll try again.

Jiang-Stan commented 1 year ago

I noticed that only one thread is set to save data from here. I tried to use 32 threads but it still cannot finish saving. @pzelasko

By separating recordings and annotations in manifest into small sets, I successfully generate wenetspeech_cuts_M_{i}.jsonl.gz(i=0~9) within an hour. Since recordings and supervisions is saved sequentially, it won't take too long time to match them. @SaltedSlark

SaltedSlark commented 1 year ago

I noticed that only one thread is set to save data from here. I tried to use 32 threads but it still cannot finish saving. @pzelasko

By separating recordings and annotations in manifest into small sets, I successfully generate wenetspeech_cuts_M_{i}.jsonl.gz(i=0~9) within an hour. Since recordings and supervisions is saved sequentially, it won't take too long time to match them. @SaltedSlark

Thanks! But I don't know how to separate recodings and supervisions in manifest, need your help, bro.

Jiang-Stan commented 1 year ago

I noticed that only one thread is set to save data from here. I tried to use 32 threads but it still cannot finish saving. @pzelasko By separating recordings and annotations in manifest into small sets, I successfully generate wenetspeech_cuts_M_{i}.jsonl.gz(i=0~9) within an hour. Since recordings and supervisions is saved sequentially, it won't take too long time to match them. @SaltedSlark

Thanks! But I don't know how to separate recodings and supervisions in manifest, need your help, bro.

manifests = read_manifests_if_cached(
        dataset_parts=dataset_parts,
        output_dir=args.src_dir,
        prefix=args.prefix,
        suffix=args.suffix,
        types=["recordings", "supervisions", "cuts"],
    )

    if args.prefix == "wenetspeech" and ("M" in manifests.keys() or "L" in manifests.keys()):
        from lhotse.audio import RecordingSet
        from lhotse.supervision import SupervisionSet
        separate_num = 10 if "M" in manifests.keys() else 100
        name = "M" if "M" in manifests.keys() else "L"
        origin_manifest = manifests.pop(name)
        recordings = [r for r in origin_manifest["recordings"]]
        supervisions = [s for s in origin_manifest["supervisions"]]
        start_idx = 0
        for i in tqdm(range(separate_num)):
            subset_name = name+str(i)
            end_idx = len(recordings)*(i+1)//separate_num
            cur_recordings = recordings[start_idx:end_idx]
            cur_supervisions = []
            for r in cur_recordings:
                match = True
                while match:
                    if len(supervisions)>0 and supervisions[0].recording_id == r.id:
                        cur_supervisions.append(supervisions.pop(0))
                    else:
                        match = False

            manifests[subset_name] = {
                "recordings": RecordingSet.from_recordings(cur_recordings),
                "supervisions": SupervisionSet.from_segments(cur_supervisions)
            }
            start_idx = end_idx
        assert len(supervisions) == 0
pzelasko commented 1 year ago

Some tips:

In [8]: cuts.split(2) Out[8]: [CutSet(len=760) [underlying data type: <class 'dict'>], CutSet(len=759) [underlying data type: <class 'dict'>]]


- `cuts.compute_and_store_features_batch` is bottlenecked by I/O in 99% of the use cases since feature extraction is usually much quicker than dataloading. Try to set the highest possible `batch_duration` first, and then keep increasing `num_workers` until you start seeing crashes, freezes, or slowdowns.
- if you're computing features on CPUs or have multiple GPUs, it's generally a good idea to split a single large cut set into parts as was suggested earlier and run multiple scripts processing these parts in parallel; for CPU based computation generally prefer `compute_and_store_features` though as it supports in-built parallelization across CPUs (unlike the batch version)
OswaldoBornemann commented 7 months ago

So is there possible to use on the fly in the function compute_and_store_features_batch ?

pzelasko commented 7 months ago

I didn’t get your question, please elaborate.

OswaldoBornemann commented 7 months ago

Sorry for my imcompleted asking. So my question is whether we can on-the-fly calculate the feature and not store them during the training process? Because in my case, I don't have such large GPU for the training.

pzelasko commented 7 months ago

Yes, you can compute the features inside the PyTorch dataset class. See OnTheFlyFeatures or K2SpeechRecognitionDataset for some examples. You can also look up k2-fsa/icefall repo for recipes that support this.

OswaldoBornemann commented 7 months ago

That's great. I will try to revise it. Thanks a lot.