gooofy / zamia-speech

Open tools and data for cloudless automatic speech recognition
GNU Lesser General Public License v3.0
443 stars 86 forks source link

Hardware requirements to trains Kaldi models #72

Open OleksandrChekmez opened 4 years ago

OleksandrChekmez commented 4 years ago

Dear Guenter, It will be very helpful to know hardware requirements, to avoid problems with lack of RAM, HDD or GPU RAM and wasting time to trying to train using not enough powerful computer.

I understand that there may be no well defined requirements, everything depends from used corpora, configs, etc.

But would you mind at least sharing your hardware spec to understand what was enough to build kaldi-generic-en-tdnn_f model. And how much time it took. Thank you!

pguyot commented 4 years ago

Most of the training (time-wise) does not actually require a GPU. The GPU is only used at the very end of the process, and the script aborts too early. I have been using an alternate script for building French models and move data to another VM for the GPU part.

The process is often CPU-bound, at least on my setup, and not always optimized for several cores.

CPU usage to train French model

Günter uses a 64GB machine but in my experience, 16GB or even 12GB can prove sufficient. You need quite a lot of disk to store every clip in 16kHz wav format - that's 150 GB for 1200 hours - and some more to handle the conversion between formats.

For the initial French model (200 hours), it took me about a month, including working on transcripts and IPAs and French-specific adaptations. I expect the same time will be required for a newer model with 400 hours as I am running it on a larger VM (4 vCPU). Günter is using a faster box (two CPUs with 6 cores each) and reported that it took him 5-6 weeks for the English model (1200 hours).

joazoa commented 4 years ago

@pguyot can you share your split CPU / GPU scripts ? How many CPU cores / memory are required per GPU ? What other issues have you seen while training french models ? I found the .ipa files to have missing entries, the quality in the transcripts is wrong, the CNTRL sentence import hangs, what else can I expect and how can i help ? I can use 3 machines with 28 cores each, do you have a way to split the work over multiple pc's ?

@OleksandrChekmez, I ran the small german model on 50 hours of audio with 28 cores and 1 1080 ti card in ~24 hours. Memory usage ~16gb.

pguyot commented 4 years ago

@joazoa Sorry if my previous message was unclear about CPU/GPU requirements.

I have been renting a VM with a GPU and I found out that the GPU is required too early by script kaldi-run-chain.sh: https://github.com/gooofy/zamia-speech/blob/master/data/src/speech/kaldi-run-chain.sh#L55

It is not used until stage 1 of train.py which is invoked in stage "11": https://github.com/gooofy/zamia-speech/blob/master/data/src/speech/kaldi-run-chain.sh#L250

The rest is CPU or I/O-bound (mostly CPU). Too many cores can be a waste of computing power as Kaldi splits data in jobs and some jobs can prove significantly longer than others (eventually n-1 cores are waiting for a single core to finish). You can set the number of jobs as printed out from this line: https://github.com/gooofy/zamia-speech/blob/master/data/src/speech/kaldi-run-chain.sh#L65

My script is just an adaptation of kaldi-run-chain.sh, writing snapshots after every step which allowed me to debug transcripts, IPAs and some of the scripts.

I've been working on the French model, which we may discuss on another thread. My patches may require a careful review for reproductibility and I am very glad you are trying!

Indeed, the quality flag of transcripts is ignored as verbatim are not stored in tokenized form in CSVs. This may or may not be a good idea, but does it prevent you from using the standard script? What do you mean with ".ipa files have missing entries"? Many words from several verbatims entries are not in IPA file, yet are generated by sequituur. I tried to add as many entries as possible, especially those for which the sequituur model generated wrong pronunciations. What do you mean with "the CNTRL sentence import hangs"? Please do not hesitate to open a ticket for this with detail, I'll look into it.

Considering parallelization on several boxes:

joazoa commented 4 years ago

Hello,

I noticed that when i did a test run for german that the cpu did not get used until epoch 1of10, probably spent half a day debugging why my cuda wasn't working until i ran it a bit longer once :)

I will try and document the use of multiple gpu's and maybe slurm usage once i get that to that stage with the french model.

I will leave a comment for everything french related in the other ticket.