GigaSpeech recipe - Githubissues

wgb14 commented 2 years ago

Features:

support BPE based lang
~on-the-fly feature extraction~ chunked feature extraction with GPU by default

TODO:

[x] support phone based lang
- [ ] ~cmudict2lexicon~
- [ ] ~g2p~
- [x] download pretrained lexicon and LM
[ ] ~language model pruning~
[x] debugging and running experiments

danpovey commented 2 years ago

There are also commands in lhotse to split and recombine data. I forget the names/invocation but Piotr answered my question on this in either this repo or lhotse's repo.

On Wed, Nov 17, 2021 at 9:57 PM Piotr Żelasko @.***> wrote:

@.**** commented on this pull request.

In egs/gigaspeech/ASR/local/compute_fbank_gigaspeech.py https://github.com/k2-fsa/icefall/pull/120#discussion_r751264888:

as the sampler won't be able to do it later in an

efficient manner.

cut_set = cut_set.shuffle()

if args.precomputed_features:

Extract the features after cutting large recordings into

smaller cuts.

Note:

we support very efficient "chunked" feature reads with

the argument storage_type=ChunkedLilcomHdf5Writer,

but we don't support efficient data augmentation and

feature computation for long recordings yet.

Therefore, we sacrifice some storage for the ability to

precompute features on shorter chunks,

without memory blow-ups.

cut_set = cut_set.compute_and_store_features(

... actually, if you're using speed perturbation or other augmentations, ditching them might solve your issues. You can still use MUSAN, SpecAugment, etc. in Dataset later.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/pull/120#discussion_r751264888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO664DU5Z5JGHDEOIZDUMOYF5ANCNFSM5H7PTUFA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

csukuangfj commented 2 years ago

I forget the names/invocation but Piotr answered my question on this in either this repo or lhotse's repo.

It is in https://github.com/lhotse-speech/lhotse/issues/452#issuecomment-962402670

csukuangfj commented 2 years ago

Looks like there is not much progress about the GigaSpeech recipe for more than 2 weeks.

I just made a PR https://github.com/wgb14/icefall/pull/1 to compute the features by splitting the manifests before extraction and combining them afterwards.

It seems to resolve the OOM issue. The expected time to extract the features of the XL subset is about 2 days using a single GPU, I think. If you use more GPUs, it should decrease the time linearly. (Note: After speed perturbing, the XL subset contains 30k hours of data. The GPU is idle most of the time, so I think computation is not the bottleneck).

csukuangfj commented 2 years ago

The screenshot below compares the speed of feature extraction between CUDA and CPU on 1 of the 1000 pieces of the XL subset.

CUDA: 3 minutes 13 seconds = 193 seconds
CPU: 7 minutes 19 seconds = 439 seconds
439 / 193 = 2.2746

aa21

csukuangfj commented 2 years ago

The following screenshot shows the memory consumption about extracting features of the XL subset on CUDA with --num-workers=20 --batch-duration=600.

I believe there will be no OOM anymore. Otherwise, we have to use a larger splits number, e.g., 2000 instead of 1000. (Note: A split value of 100 still causes OOM.)

Screen Shot 2021-11-28 at 15 33 45

csukuangfj commented 2 years ago

Also, note that we don't need to limit the number of decoding threads used byffmpeg.

diff --git a/lhotse/audio.py b/lhotse/audio.py
index 2190dc9..ca623cf 100644
--- a/lhotse/audio.py
+++ b/lhotse/audio.py
@@ -1437,7 +1437,8 @@ def read_opus_ffmpeg(
     :return: a tuple of audio samples and the sampling rate.
     """
     # Construct the ffmpeg command depending on the arguments passed.
-    cmd = f"ffmpeg -threads 1"
+    #  cmd = f"ffmpeg -threads 1"
+    cmd = f"ffmpeg"
     sampling_rate = 48000
     # Note: we have to add offset and duration options (-ss and -t) BEFORE specifying the input
     #       (-i), otherwise ffmpeg will decode everything and trim afterwards...
@@ -1452,7 +1453,8 @@ def read_opus_ffmpeg(
         cmd += f" -ar {force_opus_sampling_rate}"
         sampling_rate = force_opus_sampling_rate
     # Read audio samples directly as float32.
-    cmd += " -f f32le -threads 1 pipe:1"
+    #  cmd += " -f f32le -threads 1 pipe:1"
+    cmd += " -f f32le pipe:1"
     # Actual audio reading.
     proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
     raw_audio = proc.stdout

@pzelasko Shall we revert https://github.com/lhotse-speech/lhotse/pull/481/files

danpovey commented 2 years ago

Great! Regarding limiting ffmpeg threads, can we see whether it makes a difference to speed before reverting that? Sometimes when you run something in multiple processes, having it use multiple threads can be slower, depending on the mechanism it uses.

csukuangfj commented 2 years ago

Regarding limiting ffmpeg threads, can we see whether it makes a difference to speed before reverting that?

Will compare the speed with/without multiple decoding threads for ffmpeg.

wgb14 commented 2 years ago

This time I didn't get OOM, but got a CUDA OOM error while processing 137/1000:

RuntimeError: CUDA out of memory. Tried to allocate 8.07 GiB (GPU 0; 22.41 GiB total capacity; 12.49 GiB already allocated; 7.77 GiB free; 14.00 GiB reserved in total by PyTorch)

I believe that 22GB of GPU memory is already higher than the average level of commonly used GPUs, so I'll reduce the value of some params. Which one is preferred, --num-workers=20 or --batch-duration=600?

csukuangfj commented 2 years ago

This time I didn't get OOM, but got a CUDA OOM error while processing 137/1000:

Please install the latest kaldifeat. There was a bug in it. It was not using chunk_size in computing features, so it caused CUDA OOM for long utterances.

I just fixed it in https://github.com/csukuangfj/kaldifeat/pull/22

BTW, ./prepare.sh --stage 6 will continue the extraction from where it stopped.

csukuangfj commented 2 years ago

I believe that 22GB of GPU memory is already higher than the average level of commonly used GPUs

Because utterances in GigaSpeech are several hours long andchunkwise extraction was disabled before by mistake.

After that fix, it should not use that much GPU memory anymore.

pzelasko commented 2 years ago

Regarding limiting ffmpeg threads, can we see whether it makes a difference to speed before reverting that?

Will compare the speed with/without multiple decoding threads for ffmpeg.

Even if you notice no speed difference, I want to avoid spawning num_cpu number of threads for every ffmpeg process, I think it might have been the reason why some of my large training jobs were complaining that I have exhausted system resources even though the memory was fine. These jobs were spawning a lot of ffmpeg subprocesses and I suspected it might be related to that.

wgb14 commented 2 years ago

Did you ever get this error?

ValueError: Requested more audio (42213.59s) than available (42213.5886875s)

I feel that this shouldn't be raised.

pzelasko commented 2 years ago

Can you checkout this Lhotse PR? It allows you to set the tolerance threshold for mismatches like these.

https://github.com/lhotse-speech/lhotse/pull/491

csukuangfj commented 2 years ago

I would recommend splitting the manifests into 2000 pieces as it also causes OOM for some pieces if splitting into 1000 pieces.

Also, we can open several terminals on one machine or several machines and do

python3 ./local/compute_fbank_gigaspeech_splits.py --num-splits 2000 --start 0 --stop 100 --num-workers 5
python3 ./local/compute_fbank_gigaspeech_splits.py --num-splits 2000 --start 100 --stop 200 --num-workers 5
CUDA_VISIBLE_DEVICES=1 python3 ./local/compute_fbank_gigaspeech_splits.py --num-splits 2000 --start 200 --stop 300 --num-workers 5

which can reduce the extraction time.

See https://github.com/wgb14/icefall/pull/2

csukuangfj commented 2 years ago

Did you ever get this error?
ValueError: Requested more audio (42213.59s) than available (42213.5886875s)
I feel that this shouldn't be raised.

The following change works for me with that PR:

--- a/egs/gigaspeech/ASR/local/compute_fbank_gigaspeech_splits.py
+++ b/egs/gigaspeech/ASR/local/compute_fbank_gigaspeech_splits.py
@@ -28,6 +28,7 @@ from lhotse import (
     KaldifeatFbank,
     KaldifeatFbankConfig,
 )
+from lhotse.audio import set_audio_duration_mismatch_tolerance

 # Torch's multithreaded behavior needs to be disabled or
 # it wastes a lot of CPU and slow things down.
@@ -80,6 +81,7 @@ def get_parser():

 def compute_fbank_gigaspeech_splits(args):
+    set_audio_duration_mismatch_tolerance(0.01)  # seconds
     num_splits = args.num_splits
     output_dir = f"data/fbank/XL_split_{num_splits}"
     output_dir = Path(output_dir)

[EDITED]: Have to call it after setting up the logger.

wgb14 commented 2 years ago

Yes, this also works for me. But It also muted my logger, so I commented out logging in set_audio_duration_mismatch_tolerance

wgb14 commented 2 years ago

By the way, as I checked GPU memory allocation during extraction, I can see about 22GB is allocated. Is this normal behavior? Do we release GPU after extracting each split?

csukuangfj commented 2 years ago

But It also muted my logger, so I commented out logging in set_audio_duration_mismatch_tolerance

You have to put it after setting up the logger.

csukuangfj commented 2 years ago

Do we release GPU after extracting each split?

No. The allocated GPU memory is cached by PyTorch, I think. I think you can limit the GPU memory by changing the chunk size while creating the feature extactor.

csukuangfj commented 2 years ago

By the way, as I checked GPU memory allocation during extraction, I can see about 22GB is allocated. Is this normal behavior? Do we release GPU after extracting each split?

I just had a CUDA OOM error: Screen Shot 2021-12-01 at 07 17 29

I think we may need to empty the cached memory allocated by PyTorch after processing each split.

danpovey commented 2 years ago

Since this is a PyTorch allocation error, surely it would free any cached memory if that would be helpful? (Obviously if we still hold on to a reference to that memory somehow, it's a different matter).

csukuangfj commented 2 years ago

From the error message:

RuntimeError: CUDA out of memory. Tried to allocate 12.51 GiB (GPU 0; 31.75 GiB total capacity; 11.11 GiB already allocated; 10.74 GiB free; 19.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I think the OOM is caused by fragmentation. I will try to empty the cache after processing each split and see whether it will cause CUDA OOM again.

wgb14 commented 2 years ago

By the way, as I checked GPU memory allocation during extraction, I can see about 22GB is allocated. Is this normal behavior? Do we release GPU after extracting each split?

I just had a CUDA OOM error:

I think we may need to empty the cached memory allocated by PyTorch after processing each split.

I got the same error, and reducing the chunk_size doesn't help.

csukuangfj commented 2 years ago

Could you try the following change:

@@ -123,6 +125,8 @@ def compute_fbank_gigaspeech_splits(args):
             batch_duration=args.batch_duration,
             storage_type=ChunkedLilcomHdf5Writer,
         )
+        if device.type == 'cuda':
+            torch.cuda.empty_cache()

The memory occupation reported by nvidia-smi before and after torch.cuda.empty_cache() are given below:

If it still does not help, I would suggest empty the cache after processing every batch.

wgb14 commented 2 years ago

Could you try the following change:
@@ -123,6 +125,8 @@ def compute_fbank_gigaspeech_splits(args):
             batch_duration=args.batch_duration,
             storage_type=ChunkedLilcomHdf5Writer,
         )
+        if device.type == 'cuda':
+            torch.cuda.empty_cache()
If it still does not help, I would suggest empty the cache after processing every batch.

I tried, but got CUDA OOM at the same split where it stopped last time. I'll try empty_cache() after each batch in compute_and_store_features_batch.

wgb14 commented 2 years ago

I tried this:

diff --git a/lhotse/cut.py b/lhotse/cut.py
index 9583eab..29d39cd 100644
--- a/lhotse/cut.py
+++ b/lhotse/cut.py
@@ -3838,6 +3838,8 @@ class CutSet(Serializable, Sequence[Cut]):
                     features = extractor.extract_batch(
                         waves, sampling_rate=cuts[0].sampling_rate
                     )
+                    if extractor.device.type == 'cuda':
+                        torch.cuda.empty_cache()

                 for cut, feat_mtx in zip(cuts, features):
                     if isinstance(cut, PaddingCut):

but got CUDA OOM in another batch in the same split.

csukuangfj commented 2 years ago

Some information about the split that causes CUDA OOM for me is: a21a

You can see that utterances in the split are 10 to 20 hours long.

I am trying to identify the utterance that is causing CUDA OOM.

danpovey commented 2 years ago

How about working on the code that does GPU-based feature extraction, to add some kind of wrapper that will split and then separately compute and recombine very long recordings?

On Wed, Dec 1, 2021 at 2:10 PM Fangjun Kuang @.***> wrote:

Some information about the split that causes CUDA OOM for me is: [image: a21a] https://user-images.githubusercontent.com/5284924/144180850-39001f8d-968b-4c89-a631-9ab8d5e863a0.png

You can see that utterances in the split are 10 to 20 hours long.

I am trying to identify the utterance that is causing CUDA OOM.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/pull/120#issuecomment-983321296, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO27BMYI3OZI44SBWYTUOW353ANCNFSM5H7PTUFA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

csukuangfj commented 2 years ago

How about working on the code that does GPU-based feature extraction, to add some kind of wrapper that will split and then separately compute and recombine very long recordings?

Yes, that is how we are doing currently. Please see https://github.com/csukuangfj/kaldifeat/blob/d2652a2c493678f918eef69bbdb00ff9776b8a6c/kaldifeat/python/kaldifeat/offline_feature.py#L114-L127

            assert chunk_size > 0
            num_chunks = x.size(0) // chunk_size
            end = 0
            features = []
            for i in range(num_chunks):
                start = i * chunk_size
                end = start + chunk_size
                this_chunk = self.computer.compute_features(
                    x[start:end], vtln_warp
                )
                features.append(this_chunk)
            if end < x.size(0):
                last_chunk = self.computer.compute_features(x[end:], vtln_warp)
                features.append(last_chunk)
            features = torch.cat(features, dim=0)

The default chunk size used in lhotse is 1000. See https://github.com/lhotse-speech/lhotse/blob/master/lhotse/features/kaldifeat.py#L158

    # This is an extra setting compared to kaldifeat FbankOptions:
    # by default, we'll ask kaldifeat to compute the feats in chunks
    # to avoid excessive memory usage.
    chunk_size: Optional[int] = 1000

danpovey commented 2 years ago

I think it might make more sense to transfer to CPU after computing each chunk, and concatenate on CPU. And also transfer only the chunks to CUDA. Could use a much larger chunk size in that case, perhaps.

csukuangfj commented 2 years ago

Thanks, will change it.

csukuangfj commented 2 years ago

I think it might make more sense to transfer to CPU after computing each chunk, and concatenate on CPU.

See https://github.com/lhotse-speech/lhotse/pull/498 and https://github.com/csukuangfj/kaldifeat/pull/23

I think it should fix the CUDA OOM errors with kaldifeat v1.12.

csukuangfj commented 2 years ago

Also, we don't need to empty the cache after each split. I just verified with the output of nvidia-smi, the GPU RAM consumption is rather stable, about 3.6 GB when the chunk size is 20 minutes.

csukuangfj commented 2 years ago

I just realized that we have to disable the caching mechanism in lhotse when reading audio.

The default size of cached items is 512. For corpus like GigaSpeech where utterances are several hours long, it's going to take lots of RAM and causes OOM easily. Also, as each audio file is read only once from disk and there is no point caching it, I think.

See https://github.com/lhotse-speech/lhotse/blob/master/lhotse/audio.py#L1064-L1065

@dynamic_lru_cache
def read_audio(

https://github.com/lhotse-speech/lhotse/blob/master/lhotse/caching.py#L37

    To disable/enable caching globally in Lhotse, call::
        >>> from lhotse import set_caching_enabled
        >>> set_caching_enabled(True)   # enable
        >>> set_caching_enabled(False)  # disable
    Currently it hard-codes the cache size at 512 items
    (regardless of the array size).

pzelasko commented 2 years ago

Good point! Never noticed it because I was using on the fly features.

wgb14 commented 2 years ago

Finally, I finished feature extraction, and now I'm about to start the model training. Any suggestions on parameters? Or I'll begin with scripts in egs/librispeech/ASR/conformer_ctc/, and utilize params in https://github.com/k2-fsa/icefall/blob/95af0397336ac840a5bfed1ae8de79dbddcdad71/egs/librispeech/ASR/RESULTS.md? By the way, do we support training on multi-GPU multi-node now?

danpovey commented 2 years ago

Cool! I would start with the librispeech setup for now, although the optimal setup would probably have more layers and/or have larger d_model. I believe Fangjun has experimented with multi-GPU training, although I'm not sure if we currently have support for it.

csukuangfj commented 2 years ago

As for multi-node multi-GPU training, there is a PR about it. Please see https://github.com/k2-fsa/icefall/pull/63

The entrypoint is the file egs/librispeech/ASR/conformer_ctc/run-multi-node-multi-gpu.sh in that PR.

I would recommend you to use PyTorch >= 1.9.0 so that you can use a different number of GPUs on each node.

Please leave a message if you encounter any issues.

[EDITED]: Please use machines belonging to the same subnet. Otherwise, the communication overhead may dominate the training time.

wgb14 commented 2 years ago

I got this warning if set --small-dev True. I think 28 is actually the number of recordings.

Also, if --world-size >1, setup_logger() would not work, and I can only see warnings. Did you ever see this bug?

csukuangfj commented 2 years ago

Also, if --world-size >1, setup_logger() would not work, and I can only see warnings. Did you ever see this bug

It depends on your PyTorch version. Please see https://github.com/k2-fsa/icefall/issues/35

One workaround is to do the following inside the main() function (or do it globally)

logging.info = logging.warning

csukuangfj commented 2 years ago

I got this warning if set --small-dev True. I think 28 is actually the number of recordings.

It is caused by https://github.com/k2-fsa/icefall/blob/bea78f609445a407db2377304818da550268d79c/egs/gigaspeech/ASR/conformer_ctc/asr_datamodule.py#L375

1000 is too large for this CutSet and leads to the above warning.

wgb14 commented 2 years ago

logging.info = logging.warning

Thanks, I can see the log from console now.

I got this warning if set --small-dev True. I think 28 is actually the number of recordings.

It is caused by

https://github.com/k2-fsa/icefall/blob/bea78f609445a407db2377304818da550268d79c/egs/gigaspeech/ASR/conformer_ctc/asr_datamodule.py#L375

1000 is too large for this CutSet and leads to the above warning.

I copied this part from @pzelasko's recipe in snowfall . Here I thought 1000 refers to the numbers of utterances, since there are about 6000 lines of utterances in cuts_DEV, and we want to speed up validation. But now it seems that the option first=1000 tends to return 1000 recordings.

I got CUDA OOM error even if specified --max-duration to 100, should I continue reducing this param? Or do we have a table about the relation between GPU memory and max-duration value?

csukuangfj commented 2 years ago

I got CUDA OOM error even if specified --max-duration to 100, should I continue reducing this param? Or do we have a table about the relation between GPU memory and max-duration value?

If there are very very long utterances, e.g., several hours, in the CutSet, it is very likely that it will always cause OOM even if you use --max-duration 1.

https://github.com/k2-fsa/icefall/pull/120#discussion_r771685179 would help to fix the OOM issue due to long utterances.

To use a larger --max-duration in the training, I would recommend using https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/local/display_manifest_statistics.py to get an idea of the duration distribution and use it to remove very long utterances from the CutSet, see https://github.com/k2-fsa/icefall/blob/1d44da845b264b3a24fadcc6af577055d399220d/egs/librispeech/ASR/transducer/train.py#L616-L630

Caution: The threshold 20 is for LibriSpeech. You may need to change it for GigaSpeech. And you have to use trim_to_supervisions() before using display_manifest_statistics.py.

tz301 commented 2 years ago

I got CUDA OOM error even if specified --max-duration to 100, should I continue reducing this param? Or do we have a table about the relation between GPU memory and max-duration value?

If there are very very long utterances, e.g., several hours, in the CutSet, it is very likely that it will always cause OOM even if you use --max-duration 1.

#120 (comment) would help to fix the OOM issue due to long utterances.

To use a larger --max-duration in the training, I would recommend using https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/local/display_manifest_statistics.py to get an idea of the duration distribution and use it to remove very long utterances from the CutSet, see

https://github.com/k2-fsa/icefall/blob/1d44da845b264b3a24fadcc6af577055d399220d/egs/librispeech/ASR/transducer/train.py#L616-L630

Caution: The threshold 20 is for LibriSpeech. You may need to change it for GigaSpeech. And you have to use trim_to_supervisions() before using display_manifest_statistics.py.

I also meet similiar problem using my large data (larger than 10000h). For feature extraction, after seen gigaspeech recipe, it's ok now to use lazy cuts. Also I have many subsets, with which max is ~1500h. I create cuts and features seperately, and merge to one cuts.jsonl.gz file finally.

But for training, I can only make it work by set --max-duration=50 (80 will have OOM). It's extremely slow, will need two weeks for one epoch using 4 GPU. I see the duration distribution, 60% are less than 3s (min is 0.5s) and 2% are large than 15s (max is 20s). Maybe valid duration per batch is so small due to useless padding.

I try something: (1) Use BuckerSampler and set --max-duration=200, but it cost too much cpu memory. Each parallel will read all cuts into memory. Only --world-size=1 can be supported on my 256G memory machine. (2) Reorder cuts in advance and save to file. I split cuts into buckets ([1s, 2s], [2s, 3s], ...), shuffing inside buckets and recombined orderly. So samples in one batch will not padding too much. Not sure if this will cost converge problem or wer degradation, still run now. Using SingleCutSampler and set --max-duration=200 & --world-size=4, speed is 60h for one epoch.

Hope there will be lazy BuckerSampler or better solution. @pzelasko

danpovey commented 2 years ago

I got CUDA OOM error even if specified --max-duration to 100, should I continue reducing this param? Or do we have a table about the relation between GPU memory and max-duration value?

If there are very very long utterances, e.g., several hours, in the CutSet, it is very likely that it will always cause OOM even if you use --max-duration 1.

120 (comment)

https://github.com/k2-fsa/icefall/pull/120#discussion_r771685179 would help to fix the OOM issue due to long utterances.

To use a larger --max-duration in the training, I would recommend using https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/local/display_manifest_statistics.py to get an idea of the duration distribution and use it to remove very long utterances from the CutSet, see

https://github.com/k2-fsa/icefall/blob/1d44da845b264b3a24fadcc6af577055d399220d/egs/librispeech/ASR/transducer/train.py#L616-L630

Caution: The threshold 20 is for LibriSpeech. You may need to change it for GigaSpeech. And you have to use trim_to_supervisions() before using display_manifest_statistics.py.

I also meet similiar problem using my large data (larger than 10000h). For feature extraction, after seen gigaspeech recipe, it's ok now to use lazy cuts. Also I have many subsets, with which max is ~1500h. I create cuts and features seperately, and merge to one cuts.jsonl.gz file finally.

But for training, I can only make it work by set --max-duration=50 (80 will have OOM). It's extremely slow, will need two weeks for one epoch using 4 GPU.

It might be worth printing out some information on the failing batch to determine the utterance size and num-utterances that is failing, e.g. to see whether it's about very mismatched sizes, or one very long utterance, or many short ones. We're trying to get away from the transformer-decoder, which is a little expensive in terms of memory.

I see the duration distribution, 60% are less than 3s (min is 0.5s) and 2% are large than 15s (max is 20s). Maybe valid duration per batch is so small due to useless padding.

I try something: (1) Use BuckerSampler and set --max-duration=200, but it cost too much cpu memory. Each parallel will read all cuts into memory. Only --world-size=1 can be supported on my 256G memory machine. (2) Reorder cuts in advance and save to file. I split cuts into buckets (12s, 23s), shuffing inside buckets and recombined orderly. So samples in one batch will not padding too much. Not sure if this will cost converge problem or wer degradation, still run now. Using SingleCutSampler and set --max-duration=200 & --world-size=4, speed is 60h for one epoch.

Hope there will be lazy BuckerSampler or better solution. @pzelasko https://github.com/pzelasko

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/pull/120#issuecomment-997200697, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYKJ5O7F3BZXXRUKXDURSB5NANCNFSM5H7PTUFA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

pzelasko commented 2 years ago

I try something: (1) Use BuckerSampler and set --max-duration=200, but it cost too much cpu memory. Each parallel will read all cuts into memory. Only --world-size=1 can be supported on my 256G memory machine. (2) Reorder cuts in advance and save to file. I split cuts into buckets ([1s, 2s], [2s, 3s], ...), shuffing inside buckets and recombined orderly. So samples in one batch will not padding too much. Not sure if this will cost converge problem or wer degradation, still run now. Using SingleCutSampler and set --max-duration=200 & --world-size=4, speed is 60h for one epoch.

Hope there will be lazy BuckerSampler or better solution. @pzelasko

Check out the DynamicBucketingSampler here (added in https://github.com/lhotse-speech/lhotse/pull/517): https://github.com/lhotse-speech/lhotse/blob/master/lhotse/dataset/sampling/dynamic_bucketing.py

I am starting to look into your other issue with growing RAM to see if I can replicate it... efficient large-scale training is something we definitely want to support.

pzelasko commented 2 years ago

FYI for people tracking the RAM issues here: I added the "lazy" bucketing and described how to switch to it here: https://github.com/k2-fsa/icefall/pull/120#discussion_r776461928

If you're using precomputed features stored in HDF5 files, you might still notice growing CPU RAM usage. Unfortunately I can't find a way to disable HDF5 caches (and some people tell me it might be a memory leak in HDF5 itself). I created a different storage format that should be ready to use; if somebody wants to help me in testing its stability, there's an easy way to do so:

1) check out Lhotse at branch from PR: https://github.com/lhotse-speech/lhotse/pull/522

2) copy your existing cut manifest using the command:

$ lhotse copy-feats -t lilcom_chunky <path-to-cuts.jsonl.gz> <path-to-copy.jsonl.gz> <new-feat-dir>

Example:

$ mkdir data/fbank_cpy
$ lhotse copy-feats -t lilcom_chunky data/fbank/cuts_train-other-500.jsonl.gz data/fbank_cpy/cuts_train-other-500.jsonl.gz data/fbank_cpy/train-other-500

(please notice the "l" letter in extension .jsonl.gz; for the output cut set it MUST be in .jsonl or .jsonl.gz format)

3) change paths in the asr_datamodule.py script to the new cutset and re-run the training

wgb14 commented 2 years ago

Also update results here: The best WER, as of 2022-04-06, for the gigaspeech is below (using HLG decoding + n-gram LM rescoring + attention decoder rescoring):

	Dev	Test
WER	11.93	11.86

Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are:	ngram_lm_scale	attention_scale
0.3	1.5

Ready for review now.

danpovey commented 2 years ago

Cool! BTW, if you have time, it might be worth trying the RNN-T setup, because it tends to shine when the amount of data is very large.

k2-fsa / icefall

GigaSpeech recipe #120

@.**** commented on this pull request.

as the sampler won't be able to do it later in an

efficient manner.

Extract the features after cutting large recordings into

smaller cuts.

Note:

we support very efficient "chunked" feature reads with

the argument `storage_type=ChunkedLilcomHdf5Writer`,

but we don't support efficient data augmentation and

feature computation for long recordings yet.

Therefore, we sacrifice some storage for the ability to

precompute features on shorter chunks,

without memory blow-ups.

120 (comment)

k2-fsa / icefall

GigaSpeech recipe #120

@.**** commented on this pull request.

as the sampler won't be able to do it later in an

efficient manner.

Extract the features after cutting large recordings into

smaller cuts.

Note:

we support very efficient "chunked" feature reads with

the argument storage_type=ChunkedLilcomHdf5Writer,

but we don't support efficient data augmentation and

feature computation for long recordings yet.

Therefore, we sacrifice some storage for the ability to

precompute features on shorter chunks,

without memory blow-ups.

120 (comment)

the argument `storage_type=ChunkedLilcomHdf5Writer`,