Closed wgb14 closed 2 years ago
There are also commands in lhotse to split and recombine data. I forget the names/invocation but Piotr answered my question on this in either this repo or lhotse's repo.
On Wed, Nov 17, 2021 at 9:57 PM Piotr Żelasko @.***> wrote:
@.**** commented on this pull request.
In egs/gigaspeech/ASR/local/compute_fbank_gigaspeech.py https://github.com/k2-fsa/icefall/pull/120#discussion_r751264888:
as the sampler won't be able to do it later in an
efficient manner.
- cut_set = cut_set.shuffle()
- if args.precomputed_features:
Extract the features after cutting large recordings into
smaller cuts.
Note:
we support very efficient "chunked" feature reads with
the argument
storage_type=ChunkedLilcomHdf5Writer
,but we don't support efficient data augmentation and
feature computation for long recordings yet.
Therefore, we sacrifice some storage for the ability to
precompute features on shorter chunks,
without memory blow-ups.
- cut_set = cut_set.compute_and_store_features(
... actually, if you're using speed perturbation or other augmentations, ditching them might solve your issues. You can still use MUSAN, SpecAugment, etc. in Dataset later.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/pull/120#discussion_r751264888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO664DU5Z5JGHDEOIZDUMOYF5ANCNFSM5H7PTUFA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I forget the names/invocation but Piotr answered my question on this in either this repo or lhotse's repo.
It is in https://github.com/lhotse-speech/lhotse/issues/452#issuecomment-962402670
Looks like there is not much progress about the GigaSpeech recipe for more than 2 weeks.
I just made a PR https://github.com/wgb14/icefall/pull/1 to compute the features by splitting the manifests before extraction and combining them afterwards.
It seems to resolve the OOM issue. The expected time to extract the features of the XL subset is about 2 days using a single GPU, I think. If you use more GPUs, it should decrease the time linearly. (Note: After speed perturbing, the XL subset contains 30k hours of data. The GPU is idle most of the time, so I think computation is not the bottleneck).
The screenshot below compares the speed of feature extraction between CUDA and CPU on 1 of the 1000 pieces of the XL subset.
The following screenshot shows the memory consumption about extracting features of the XL subset on CUDA with --num-workers=20 --batch-duration=600
.
I believe there will be no OOM anymore. Otherwise, we have to use a larger splits number, e.g., 2000 instead of 1000. (Note: A split value of 100 still causes OOM.)
Also, note that we don't need to limit the number of decoding threads used byffmpeg
.
diff --git a/lhotse/audio.py b/lhotse/audio.py
index 2190dc9..ca623cf 100644
--- a/lhotse/audio.py
+++ b/lhotse/audio.py
@@ -1437,7 +1437,8 @@ def read_opus_ffmpeg(
:return: a tuple of audio samples and the sampling rate.
"""
# Construct the ffmpeg command depending on the arguments passed.
- cmd = f"ffmpeg -threads 1"
+ # cmd = f"ffmpeg -threads 1"
+ cmd = f"ffmpeg"
sampling_rate = 48000
# Note: we have to add offset and duration options (-ss and -t) BEFORE specifying the input
# (-i), otherwise ffmpeg will decode everything and trim afterwards...
@@ -1452,7 +1453,8 @@ def read_opus_ffmpeg(
cmd += f" -ar {force_opus_sampling_rate}"
sampling_rate = force_opus_sampling_rate
# Read audio samples directly as float32.
- cmd += " -f f32le -threads 1 pipe:1"
+ # cmd += " -f f32le -threads 1 pipe:1"
+ cmd += " -f f32le pipe:1"
# Actual audio reading.
proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
raw_audio = proc.stdout
@pzelasko Shall we revert https://github.com/lhotse-speech/lhotse/pull/481/files
Great! Regarding limiting ffmpeg threads, can we see whether it makes a difference to speed before reverting that? Sometimes when you run something in multiple processes, having it use multiple threads can be slower, depending on the mechanism it uses.
Regarding limiting ffmpeg threads, can we see whether it makes a difference to speed before reverting that?
Will compare the speed with/without multiple decoding threads for ffmpeg.
This time I didn't get OOM, but got a CUDA OOM error while processing 137/1000:
RuntimeError: CUDA out of memory. Tried to allocate 8.07 GiB (GPU 0; 22.41 GiB total capacity; 12.49 GiB already allocated; 7.77 GiB free; 14.00 GiB reserved in total by PyTorch)
I believe that 22GB of GPU memory is already higher than the average level of commonly used GPUs, so I'll reduce the value of some params. Which one is preferred, --num-workers=20
or --batch-duration=600
?
This time I didn't get OOM, but got a CUDA OOM error while processing 137/1000:
Please install the latest kaldifeat. There was a bug in it. It was not using chunk_size
in computing features, so it caused CUDA OOM for long utterances.
I just fixed it in https://github.com/csukuangfj/kaldifeat/pull/22
BTW, ./prepare.sh --stage 6
will continue the extraction from where it stopped.
I believe that 22GB of GPU memory is already higher than the average level of commonly used GPUs
Because utterances in GigaSpeech are several hours long andchunkwise
extraction was disabled before by mistake.
After that fix, it should not use that much GPU memory anymore.
Regarding limiting ffmpeg threads, can we see whether it makes a difference to speed before reverting that?
Will compare the speed with/without multiple decoding threads for ffmpeg.
Even if you notice no speed difference, I want to avoid spawning num_cpu number of threads for every ffmpeg process, I think it might have been the reason why some of my large training jobs were complaining that I have exhausted system resources even though the memory was fine. These jobs were spawning a lot of ffmpeg subprocesses and I suspected it might be related to that.
Did you ever get this error?
ValueError: Requested more audio (42213.59s) than available (42213.5886875s)
I feel that this shouldn't be raised.
Can you checkout this Lhotse PR? It allows you to set the tolerance threshold for mismatches like these.
I would recommend splitting the manifests into 2000 pieces as it also causes OOM for some pieces if splitting into 1000 pieces.
Also, we can open several terminals on one machine or several machines and do
python3 ./local/compute_fbank_gigaspeech_splits.py --num-splits 2000 --start 0 --stop 100 --num-workers 5
python3 ./local/compute_fbank_gigaspeech_splits.py --num-splits 2000 --start 100 --stop 200 --num-workers 5
CUDA_VISIBLE_DEVICES=1 python3 ./local/compute_fbank_gigaspeech_splits.py --num-splits 2000 --start 200 --stop 300 --num-workers 5
which can reduce the extraction time.
Did you ever get this error?
ValueError: Requested more audio (42213.59s) than available (42213.5886875s)
I feel that this shouldn't be raised.
The following change works for me with that PR:
--- a/egs/gigaspeech/ASR/local/compute_fbank_gigaspeech_splits.py
+++ b/egs/gigaspeech/ASR/local/compute_fbank_gigaspeech_splits.py
@@ -28,6 +28,7 @@ from lhotse import (
KaldifeatFbank,
KaldifeatFbankConfig,
)
+from lhotse.audio import set_audio_duration_mismatch_tolerance
# Torch's multithreaded behavior needs to be disabled or
# it wastes a lot of CPU and slow things down.
@@ -80,6 +81,7 @@ def get_parser():
def compute_fbank_gigaspeech_splits(args):
+ set_audio_duration_mismatch_tolerance(0.01) # seconds
num_splits = args.num_splits
output_dir = f"data/fbank/XL_split_{num_splits}"
output_dir = Path(output_dir)
[EDITED]: Have to call it after setting up the logger.
Yes, this also works for me. But It also muted my logger, so I commented out logging in set_audio_duration_mismatch_tolerance
By the way, as I checked GPU memory allocation during extraction, I can see about 22GB is allocated. Is this normal behavior? Do we release GPU after extracting each split?
But It also muted my logger, so I commented out logging in set_audio_duration_mismatch_tolerance
You have to put it after setting up the logger.
Do we release GPU after extracting each split?
No. The allocated GPU memory is cached by PyTorch, I think. I think you can limit the GPU memory by changing the chunk size while creating the feature extactor.
By the way, as I checked GPU memory allocation during extraction, I can see about 22GB is allocated. Is this normal behavior? Do we release GPU after extracting each split?
I just had a CUDA OOM error:
I think we may need to empty the cached memory allocated by PyTorch after processing each split.
Since this is a PyTorch allocation error, surely it would free any cached memory if that would be helpful? (Obviously if we still hold on to a reference to that memory somehow, it's a different matter).
From the error message:
RuntimeError: CUDA out of memory. Tried to allocate 12.51 GiB (GPU 0; 31.75 GiB total capacity; 11.11 GiB already allocated; 10.74 GiB free; 19.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I think the OOM is caused by fragmentation. I will try to empty the cache after processing each split and see whether it will cause CUDA OOM again.
By the way, as I checked GPU memory allocation during extraction, I can see about 22GB is allocated. Is this normal behavior? Do we release GPU after extracting each split?
I just had a CUDA OOM error:
I think we may need to empty the cached memory allocated by PyTorch after processing each split.
I got the same error, and reducing the chunk_size
doesn't help.
Could you try the following change:
@@ -123,6 +125,8 @@ def compute_fbank_gigaspeech_splits(args):
batch_duration=args.batch_duration,
storage_type=ChunkedLilcomHdf5Writer,
)
+ if device.type == 'cuda':
+ torch.cuda.empty_cache()
The memory occupation reported by nvidia-smi before and after torch.cuda.empty_cache()
are given below:
If it still does not help, I would suggest empty the cache after processing every batch.
Could you try the following change:
@@ -123,6 +125,8 @@ def compute_fbank_gigaspeech_splits(args): batch_duration=args.batch_duration, storage_type=ChunkedLilcomHdf5Writer, ) + if device.type == 'cuda': + torch.cuda.empty_cache()
If it still does not help, I would suggest empty the cache after processing every batch.
I tried, but got CUDA OOM at the same split where it stopped last time.
I'll try empty_cache()
after each batch in compute_and_store_features_batch
.
I tried this:
diff --git a/lhotse/cut.py b/lhotse/cut.py
index 9583eab..29d39cd 100644
--- a/lhotse/cut.py
+++ b/lhotse/cut.py
@@ -3838,6 +3838,8 @@ class CutSet(Serializable, Sequence[Cut]):
features = extractor.extract_batch(
waves, sampling_rate=cuts[0].sampling_rate
)
+ if extractor.device.type == 'cuda':
+ torch.cuda.empty_cache()
for cut, feat_mtx in zip(cuts, features):
if isinstance(cut, PaddingCut):
but got CUDA OOM in another batch in the same split.
Some information about the split that causes CUDA OOM for me is:
You can see that utterances in the split are 10 to 20 hours long.
I am trying to identify the utterance that is causing CUDA OOM.
How about working on the code that does GPU-based feature extraction, to add some kind of wrapper that will split and then separately compute and recombine very long recordings?
On Wed, Dec 1, 2021 at 2:10 PM Fangjun Kuang @.***> wrote:
Some information about the split that causes CUDA OOM for me is: [image: a21a] https://user-images.githubusercontent.com/5284924/144180850-39001f8d-968b-4c89-a631-9ab8d5e863a0.png
You can see that utterances in the split are 10 to 20 hours long.
I am trying to identify the utterance that is causing CUDA OOM.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/pull/120#issuecomment-983321296, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO27BMYI3OZI44SBWYTUOW353ANCNFSM5H7PTUFA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
How about working on the code that does GPU-based feature extraction, to add some kind of wrapper that will split and then separately compute and recombine very long recordings?
Yes, that is how we are doing currently. Please see https://github.com/csukuangfj/kaldifeat/blob/d2652a2c493678f918eef69bbdb00ff9776b8a6c/kaldifeat/python/kaldifeat/offline_feature.py#L114-L127
assert chunk_size > 0
num_chunks = x.size(0) // chunk_size
end = 0
features = []
for i in range(num_chunks):
start = i * chunk_size
end = start + chunk_size
this_chunk = self.computer.compute_features(
x[start:end], vtln_warp
)
features.append(this_chunk)
if end < x.size(0):
last_chunk = self.computer.compute_features(x[end:], vtln_warp)
features.append(last_chunk)
features = torch.cat(features, dim=0)
The default chunk size used in lhotse is 1000. See https://github.com/lhotse-speech/lhotse/blob/master/lhotse/features/kaldifeat.py#L158
# This is an extra setting compared to kaldifeat FbankOptions:
# by default, we'll ask kaldifeat to compute the feats in chunks
# to avoid excessive memory usage.
chunk_size: Optional[int] = 1000
I think it might make more sense to transfer to CPU after computing each chunk, and concatenate on CPU. And also transfer only the chunks to CUDA. Could use a much larger chunk size in that case, perhaps.
Thanks, will change it.
I think it might make more sense to transfer to CPU after computing each chunk, and concatenate on CPU.
See https://github.com/lhotse-speech/lhotse/pull/498 and https://github.com/csukuangfj/kaldifeat/pull/23
I think it should fix the CUDA OOM errors with kaldifeat v1.12.
Also, we don't need to empty the cache after each split. I just verified with the output of nvidia-smi
, the GPU RAM consumption is rather stable, about 3.6 GB when the chunk size is 20 minutes.
I just realized that we have to disable the caching mechanism in lhotse when reading audio.
The default size of cached items is 512. For corpus like GigaSpeech where utterances are several hours long, it's going to take lots of RAM and causes OOM easily. Also, as each audio file is read only once from disk and there is no point caching it, I think.
See https://github.com/lhotse-speech/lhotse/blob/master/lhotse/audio.py#L1064-L1065
@dynamic_lru_cache
def read_audio(
https://github.com/lhotse-speech/lhotse/blob/master/lhotse/caching.py#L37
To disable/enable caching globally in Lhotse, call::
>>> from lhotse import set_caching_enabled
>>> set_caching_enabled(True) # enable
>>> set_caching_enabled(False) # disable
Currently it hard-codes the cache size at 512 items
(regardless of the array size).
Good point! Never noticed it because I was using on the fly features.
Finally, I finished feature extraction, and now I'm about to start the model training.
Any suggestions on parameters? Or I'll begin with scripts in egs/librispeech/ASR/conformer_ctc/
, and utilize params in https://github.com/k2-fsa/icefall/blob/95af0397336ac840a5bfed1ae8de79dbddcdad71/egs/librispeech/ASR/RESULTS.md?
By the way, do we support training on multi-GPU multi-node now?
Cool! I would start with the librispeech setup for now, although the optimal setup would probably have more layers and/or have larger d_model. I believe Fangjun has experimented with multi-GPU training, although I'm not sure if we currently have support for it.
As for multi-node multi-GPU training, there is a PR about it. Please see https://github.com/k2-fsa/icefall/pull/63
The entrypoint is the file egs/librispeech/ASR/conformer_ctc/run-multi-node-multi-gpu.sh
in that PR.
I would recommend you to use PyTorch >= 1.9.0 so that you can use a different number of GPUs on each node.
Please leave a message if you encounter any issues.
[EDITED]: Please use machines belonging to the same subnet. Otherwise, the communication overhead may dominate the training time.
I got this warning if set --small-dev True
. I think 28 is actually the number of recordings.
Also, if --world-size
>1, setup_logger()
would not work, and I can only see warnings. Did you ever see this bug?
Also, if --world-size >1, setup_logger() would not work, and I can only see warnings. Did you ever see this bug
It depends on your PyTorch version. Please see https://github.com/k2-fsa/icefall/issues/35
One workaround is to do the following inside the main()
function (or do it globally)
logging.info = logging.warning
I got this warning if set --small-dev True. I think 28 is actually the number of recordings.
It is caused by https://github.com/k2-fsa/icefall/blob/bea78f609445a407db2377304818da550268d79c/egs/gigaspeech/ASR/conformer_ctc/asr_datamodule.py#L375
1000 is too large for this CutSet
and leads to the above warning.
logging.info = logging.warning
Thanks, I can see the log from console now.
I got this warning if set --small-dev True. I think 28 is actually the number of recordings.
It is caused by
1000 is too large for this
CutSet
and leads to the above warning.
I copied this part from @pzelasko's recipe in snowfall . Here I thought 1000 refers to the numbers of utterances, since there are about 6000 lines of utterances in cuts_DEV
, and we want to speed up validation. But now it seems that the option first=1000
tends to return 1000 recordings.
I got CUDA OOM error even if specified --max-duration
to 100, should I continue reducing this param? Or do we have a table about the relation between GPU memory and max-duration
value?
I got CUDA OOM error even if specified --max-duration to 100, should I continue reducing this param? Or do we have a table about the relation between GPU memory and max-duration value?
If there are very very long utterances, e.g., several hours, in the CutSet, it is very likely that it will always cause OOM even if you use --max-duration 1
.
https://github.com/k2-fsa/icefall/pull/120#discussion_r771685179 would help to fix the OOM issue due to long utterances.
To use a larger --max-duration
in the training, I would recommend using https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/local/display_manifest_statistics.py
to get an idea of the duration distribution and use it to remove very long utterances from the CutSet, see
https://github.com/k2-fsa/icefall/blob/1d44da845b264b3a24fadcc6af577055d399220d/egs/librispeech/ASR/transducer/train.py#L616-L630
Caution: The threshold 20
is for LibriSpeech. You may need to change it for GigaSpeech. And you have to use trim_to_supervisions()
before using display_manifest_statistics.py
.
I got CUDA OOM error even if specified --max-duration to 100, should I continue reducing this param? Or do we have a table about the relation between GPU memory and max-duration value?
If there are very very long utterances, e.g., several hours, in the CutSet, it is very likely that it will always cause OOM even if you use
--max-duration 1
.#120 (comment) would help to fix the OOM issue due to long utterances.
To use a larger
--max-duration
in the training, I would recommend using https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/local/display_manifest_statistics.py to get an idea of the duration distribution and use it to remove very long utterances from the CutSet, seeCaution: The threshold
20
is for LibriSpeech. You may need to change it for GigaSpeech. And you have to usetrim_to_supervisions()
before usingdisplay_manifest_statistics.py
.
I also meet similiar problem using my large data (larger than 10000h). For feature extraction, after seen gigaspeech recipe, it's ok now to use lazy cuts. Also I have many subsets, with which max is ~1500h. I create cuts and features seperately, and merge to one cuts.jsonl.gz file finally.
But for training, I can only make it work by set --max-duration=50 (80 will have OOM). It's extremely slow, will need two weeks for one epoch using 4 GPU. I see the duration distribution, 60% are less than 3s (min is 0.5s) and 2% are large than 15s (max is 20s). Maybe valid duration per batch is so small due to useless padding.
I try something: (1) Use BuckerSampler and set --max-duration=200, but it cost too much cpu memory. Each parallel will read all cuts into memory. Only --world-size=1 can be supported on my 256G memory machine. (2) Reorder cuts in advance and save to file. I split cuts into buckets ([1s, 2s], [2s, 3s], ...), shuffing inside buckets and recombined orderly. So samples in one batch will not padding too much. Not sure if this will cost converge problem or wer degradation, still run now. Using SingleCutSampler and set --max-duration=200 & --world-size=4, speed is 60h for one epoch.
Hope there will be lazy BuckerSampler or better solution. @pzelasko
I got CUDA OOM error even if specified --max-duration to 100, should I continue reducing this param? Or do we have a table about the relation between GPU memory and max-duration value?
If there are very very long utterances, e.g., several hours, in the CutSet, it is very likely that it will always cause OOM even if you use --max-duration 1.
120 (comment)
https://github.com/k2-fsa/icefall/pull/120#discussion_r771685179 would help to fix the OOM issue due to long utterances.
To use a larger --max-duration in the training, I would recommend using https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/local/display_manifest_statistics.py to get an idea of the duration distribution and use it to remove very long utterances from the CutSet, see
Caution: The threshold 20 is for LibriSpeech. You may need to change it for GigaSpeech. And you have to use trim_to_supervisions() before using display_manifest_statistics.py.
I also meet similiar problem using my large data (larger than 10000h). For feature extraction, after seen gigaspeech recipe, it's ok now to use lazy cuts. Also I have many subsets, with which max is ~1500h. I create cuts and features seperately, and merge to one cuts.jsonl.gz file finally.
But for training, I can only make it work by set --max-duration=50 (80 will have OOM). It's extremely slow, will need two weeks for one epoch using 4 GPU.
It might be worth printing out some information on the failing batch to determine the utterance size and num-utterances that is failing, e.g. to see whether it's about very mismatched sizes, or one very long utterance, or many short ones. We're trying to get away from the transformer-decoder, which is a little expensive in terms of memory.
I see the duration distribution, 60% are less than 3s (min is 0.5s) and 2% are large than 15s (max is 20s). Maybe valid duration per batch is so small due to useless padding.
I try something: (1) Use BuckerSampler and set --max-duration=200, but it cost too much cpu memory. Each parallel will read all cuts into memory. Only --world-size=1 can be supported on my 256G memory machine. (2) Reorder cuts in advance and save to file. I split cuts into buckets (12s, 23s), shuffing inside buckets and recombined orderly. So samples in one batch will not padding too much. Not sure if this will cost converge problem or wer degradation, still run now. Using SingleCutSampler and set --max-duration=200 & --world-size=4, speed is 60h for one epoch.
Hope there will be lazy BuckerSampler or better solution. @pzelasko https://github.com/pzelasko
— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/pull/120#issuecomment-997200697, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYKJ5O7F3BZXXRUKXDURSB5NANCNFSM5H7PTUFA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you commented.Message ID: @.***>
I try something: (1) Use BuckerSampler and set --max-duration=200, but it cost too much cpu memory. Each parallel will read all cuts into memory. Only --world-size=1 can be supported on my 256G memory machine. (2) Reorder cuts in advance and save to file. I split cuts into buckets ([1s, 2s], [2s, 3s], ...), shuffing inside buckets and recombined orderly. So samples in one batch will not padding too much. Not sure if this will cost converge problem or wer degradation, still run now. Using SingleCutSampler and set --max-duration=200 & --world-size=4, speed is 60h for one epoch.
Hope there will be lazy BuckerSampler or better solution. @pzelasko
Check out the DynamicBucketingSampler
here (added in https://github.com/lhotse-speech/lhotse/pull/517): https://github.com/lhotse-speech/lhotse/blob/master/lhotse/dataset/sampling/dynamic_bucketing.py
I am starting to look into your other issue with growing RAM to see if I can replicate it... efficient large-scale training is something we definitely want to support.
FYI for people tracking the RAM issues here: I added the "lazy" bucketing and described how to switch to it here: https://github.com/k2-fsa/icefall/pull/120#discussion_r776461928
If you're using precomputed features stored in HDF5 files, you might still notice growing CPU RAM usage. Unfortunately I can't find a way to disable HDF5 caches (and some people tell me it might be a memory leak in HDF5 itself). I created a different storage format that should be ready to use; if somebody wants to help me in testing its stability, there's an easy way to do so:
1) check out Lhotse at branch from PR: https://github.com/lhotse-speech/lhotse/pull/522
2) copy your existing cut manifest using the command:
$ lhotse copy-feats -t lilcom_chunky <path-to-cuts.jsonl.gz> <path-to-copy.jsonl.gz> <new-feat-dir>
Example:
$ mkdir data/fbank_cpy
$ lhotse copy-feats -t lilcom_chunky data/fbank/cuts_train-other-500.jsonl.gz data/fbank_cpy/cuts_train-other-500.jsonl.gz data/fbank_cpy/train-other-500
(please notice the "l" letter in extension .jsonl.gz; for the output cut set it MUST be in .jsonl
or .jsonl.gz
format)
3) change paths in the asr_datamodule.py
script to the new cutset and re-run the training
Also update results here: The best WER, as of 2022-04-06, for the gigaspeech is below (using HLG decoding + n-gram LM rescoring + attention decoder rescoring):
Dev | Test | |
---|---|---|
WER | 11.93 | 11.86 |
Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are: | ngram_lm_scale | attention_scale |
---|---|---|
0.3 | 1.5 |
Ready for review now.
Cool! BTW, if you have time, it might be worth trying the RNN-T setup, because it tends to shine when the amount of data is very large.
Features:
TODO: