Hardware requirements to trains Kaldi models

OleksandrChekmez commented 4 years ago

Dear Guenter, It will be very helpful to know hardware requirements, to avoid problems with lack of RAM, HDD or GPU RAM and wasting time to trying to train using not enough powerful computer.

I understand that there may be no well defined requirements, everything depends from used corpora, configs, etc.

But would you mind at least sharing your hardware spec to understand what was enough to build kaldi-generic-en-tdnn_f model. And how much time it took. Thank you!

pguyot commented 4 years ago

Most of the training (time-wise) does not actually require a GPU. The GPU is only used at the very end of the process, and the script aborts too early. I have been using an alternate script for building French models and move data to another VM for the GPU part.

The process is often CPU-bound, at least on my setup, and not always optimized for several cores.

Günter uses a 64GB machine but in my experience, 16GB or even 12GB can prove sufficient. You need quite a lot of disk to store every clip in 16kHz wav format - that's 150 GB for 1200 hours - and some more to handle the conversion between formats.

For the initial French model (200 hours), it took me about a month, including working on transcripts and IPAs and French-specific adaptations. I expect the same time will be required for a newer model with 400 hours as I am running it on a larger VM (4 vCPU). Günter is using a faster box (two CPUs with 6 cores each) and reported that it took him 5-6 weeks for the English model (1200 hours).

joazoa commented 4 years ago

@pguyot can you share your split CPU / GPU scripts ? How many CPU cores / memory are required per GPU ? What other issues have you seen while training french models ? I found the .ipa files to have missing entries, the quality in the transcripts is wrong, the CNTRL sentence import hangs, what else can I expect and how can i help ? I can use 3 machines with 28 cores each, do you have a way to split the work over multiple pc's ?

@OleksandrChekmez, I ran the small german model on 50 hours of audio with 28 cores and 1 1080 ti card in ~24 hours. Memory usage ~16gb.

pguyot commented 4 years ago

@joazoa Sorry if my previous message was unclear about CPU/GPU requirements.

I have been renting a VM with a GPU and I found out that the GPU is required too early by script kaldi-run-chain.sh: https://github.com/gooofy/zamia-speech/blob/master/data/src/speech/kaldi-run-chain.sh#L55

It is not used until stage 1 of train.py which is invoked in stage "11": https://github.com/gooofy/zamia-speech/blob/master/data/src/speech/kaldi-run-chain.sh#L250

The rest is CPU or I/O-bound (mostly CPU). Too many cores can be a waste of computing power as Kaldi splits data in jobs and some jobs can prove significantly longer than others (eventually n-1 cores are waiting for a single core to finish). You can set the number of jobs as printed out from this line: https://github.com/gooofy/zamia-speech/blob/master/data/src/speech/kaldi-run-chain.sh#L65

My script is just an adaptation of kaldi-run-chain.sh, writing snapshots after every step which allowed me to debug transcripts, IPAs and some of the scripts.

I've been working on the French model, which we may discuss on another thread. My patches may require a careful review for reproductibility and I am very glad you are trying!

You need my own fork of py-nltools, I have yet to generate pull requests
You need my pull requests #75 #77 and #78 (#77 is a significant optimization)
Another difference concerns "nspc" for prepare_lang script. I've been using "" instead as "nspc" did not work. I have yet to figure out why "nspc" didn't work and what should be used. It is probably related to issue #66

Indeed, the quality flag of transcripts is ignored as verbatim are not stored in tokenized form in CSVs. This may or may not be a good idea, but does it prevent you from using the standard script? What do you mean with ".ipa files have missing entries"? Many words from several verbatims entries are not in IPA file, yet are generated by sequituur. I tried to add as many entries as possible, especially those for which the sequituur model generated wrong pronunciations. What do you mean with "the CNTRL sentence import hangs"? Please do not hesitate to open a ticket for this with detail, I'll look into it.

Considering parallelization on several boxes:

Studying dependencies between steps, few steps can be performed in parallel with others.
Kaldi scripts themselves are designed to run on several boxes. This mostly apply to the CPU and I/O bound parts and requires a shared network storage. It relies on GridEngine but could be adapted to other situations. I haven't tried this. http://kaldi-asr.org/doc/queue.html
Obviously, you can perform the final training for the smaller model on a GPU-equipped box and the training for the larger model on another box. However, I am not sure "train.py" itself, which takes 12-13 days on a K80 for 400 hours of French, about half of the total wall clock training time, can be parallelized on two boxes, let alone two GPUs.

joazoa commented 4 years ago

Hello,

I noticed that when i did a test run for german that the cpu did not get used until epoch 1of10, probably spent half a day debugging why my cuda wasn't working until i ran it a bit longer once :)

I will try and document the use of multiple gpu's and maybe slurm usage once i get that to that stage with the french model.

I will leave a comment for everything french related in the other ticket.

gooofy / zamia-speech

Hardware requirements to trains Kaldi models #72