Feature extraction for 5000 hours of data

k2-fsa / icefall

https://k2-fsa.github.io/icefall/

Apache License 2.0

802 stars 270 forks source link

Feature extraction for 5000 hours of data #1558

Closed bsshruthi22 closed 2 months ago

bsshruthi22 commented 2 months ago

I am doing feature extraction for around 5000 hours of data.I am using giga speech code. my train splits are 240043.I am using a system with below config cores- 24 ram- 130gb gpu - NVIDIA RTX A6000 GPU with 48gb memory. feature extraction is going on from 2 days. 1)Does it take this long? is there anyway that I can optimize it? If yes, can i do that in between extraction. 2)Since extraction is going on,if I stop it ,is there a way to resume it to extract only which are not done. Thanks in advance

csukuangfj commented 2 months ago

https://github.com/k2-fsa/icefall/blob/4917ac8bab2e5dc0021f17249f58b7a827a83af9/egs/librispeech/ASR/local/compute_fbank_gigaspeech_splits.py#L55-L74

is there anyway that I can optimize it

Yes, you can start many terminals and each uses a different --start --stop. For instance

# Terminal 1
--start 0 --stop 100 --num-splits 240043

# Terminal 2
--start 100 --stop 200 --num-splits 240043

Since extraction is going on,if I stop it ,is there a way to resume it to extract only which are not done.

Yes, the script will skip files that are already extracted.

csukuangfj commented 2 months ago

1)Does it take this long?

It depends on the I/O of your disk and also the format of your data. If it is .wav, then it should not take so long for only 5k hours of data.

bsshruthi22 commented 2 months ago

We are using .wav files

bsshruthi22 commented 2 months ago

It's been 2 and half days 160000 splits feature extraction is done