I runned eesen on gridengine cluster only feature extraction and decoding runned on the cluster

zhangjiulong commented 8 years ago

I find in the train_ctc_parallel.sh file that train cmd is train-ctc-parallel. Is this means eesen does not support trin on the sge cluster?

yajiemiao commented 8 years ago

You can submit the running of train_ctc_parallel.sh (for example https://github.com/srvk/eesen/blob/master/asr_egs/wsj/run_ctc_phn.sh#L75) to the scheduler Alternatively, in train_ctc_parallel.sh, you can modify it by following https://github.com/srvk/eesen/blob/master/asr_egs/wsj/steps/train_ctc_parallel_h.sh#L141

zhangjiulong commented 8 years ago

I followd https://github.com/srvk/eesen/blob/master/asr_egs/wsj/steps/train_ctc_parallel_h.sh#L141 , and set nj = number of my gpus (for me I have a executor node which has 3 gpu and nj is set to 3). There are 3 trainning process on the exector node, But all the process used the same gpu. I runned nvidia-smi, the result is as follows:

| NVIDIA-SMI 361.42     Driver Version: 361.42         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 970     Off  | 0000:04:00.0     Off |                  N/A |
|  5%   61C    P8    21W / 170W |     15MiB /  4094MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 970     Off  | 0000:83:00.0     Off |                  N/A |
| 45%   61C    P8    15W / 170W |     15MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 970     Off  | 0000:84:00.0     Off |                  N/A |
| 20%   68C    P2    69W / 170W |   1026MiB /  4095MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    2     30469    C   train-ctc-parallel                             335MiB |
|    2     30470    C   train-ctc-parallel                             336MiB |
|    2     30473    C   train-ctc-parallel                             335MiB |
+-----------------------------------------------------------------------------+

yajiemiao commented 8 years ago

You should set it to the number of jobs (in your case just 1), instead of the number of GPUs. When you set it to 3, the script will submit the duplicate job for three times.

zhangjiulong commented 8 years ago

Is this means the train process can only runs on one node and uses only one gpus?

iurii-milovanov commented 8 years ago

+1 on @zhangjiulong last question.

Is it still the case that Eesen doesn't support multi-gpu training? If it's not, what is the best way to enable this option?

fmetze commented 8 years ago

We do have a multi-GPU implementation. Would some of you be available to help test it? We’ll need help in determining the best parameterization (when to average models, how many GPUs, …) unless it can be considered “stable”.

On Aug 28, 2016, at 11:06 PM, Iurii Milovanov notifications@github.com wrote:

+1 on @zhangjiulong https://github.com/zhangjiulong last question.

Is it still the case that Eesen doesn't support multi-gpu training? If it's not, what is the best way to enable this option?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/81#issuecomment-243022657, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8WI6VET5B6DvrHFHBPkEts5OHoTKks5qkkyhgaJpZM4Jcdjr.

Florian Metze http://www.cs.cmu.edu/directory/florian-metze Associate Research Professor Carnegie Mellon University

yajiemiao commented 8 years ago

EESEN's current multi-gpu implementation is the script steps/train_ctc_parallel_h.sh, which is based on naive model averaging. It is not stable yet. Some people are working on this from different aspects, but nothing concrete to check into the repos yet.

srvk / eesen

I runned eesen on gridengine cluster only feature extraction and decoding runned on the cluster #81