Closed czy97 closed 2 years ago
When I run the callhome recipe using the default config file, the GPU utility is extremely low (less than 10%). Is this normal? Could this be caused by the pit loss calculation? I found that the code calculates pit loss serially.
It can be related to the PIT loss calculation, but probably not.
The current setting (conf/train.yaml
) assumes that a user uses a relatively old/weak GPU with a small memory size. If you are using a GPU with large memory, you can decrease batchsize_per_gpu
(in conf/train.yaml
) and simultaneously increase batchsize
(in conf/train.yaml
) to put the training data as much as you can onto a GPU, while always making sure that batchsize
* batchsize_per_gpu
remains 1024
, e.g. batchsize:512
, batchsize_per_gpu:2
. By doing so, you should be able to increase the GPU utility.
Thanks for the comment. By the way, can you upload the log of Callhome recipe if possible. I can't reproduce the results you listed and I want to find the reasons.
In addtion, I find that the spk loss and pit loss calculation do influence the training speed a lot. I update the calculation here. You can check it.
Thanks for the feedback. We uploaded the log files in https://github.com/nttcslab-sp/EEND-vector-clustering/blob/main/egs/callhome/v1/Log.tar.gz
Sorry for bothering again, I find my reproduction can achieve similar results with yours when the speaker number is small. The results get worse when there is more speakers. Can you give me some advice. Thanks.
Spk# | Spk2 | Spk3 | Spk4 | Spk5 | Spk6 | Spk_all Yours | 7.96 | 11.93 | 16.38 | 21.21 | 23.10 | 12.49 Mine | 7.96 | 12.69 | 19.45 | 29.01 | 32.34 | 14.13
hello can you upload the loss of mini_Librispeech?
Hi, kli017! We have an excerpt of validation loss transition for mini_librispeech in https://github.com/nttcslab-sp/EEND-vector-clustering/blob/main/egs/mini_librispeech/v1/RESULT.md. Is this sufficient for your purpose? Or, you need an entire log for the training? We may need a couple of days to reproduce the log (since we first need to restore the experimental condition we used before).
@nttcslab-sp-admin Hi Thanks for the quick reply! I check the train log in the RESULT.md and found that the Mean Loss of mine is much higher than yours. I trained the model for 10 epochs and the loss just decrease from 0.6628450117613139 to 0.6555626417461194. And hte DER for nspk0 and nspk1 are 48.71 and 52.89 respecitvely. Have you changed any parameters or something in the recipe?
Hi, @kli017! It looks like you are using 8GPUs. Could you try to use only one GPU, i.e., CUDA_VISIBLE_DEVICES=0 and rerun the recipe? With that much GPUs (that actually changes effective batchsize, etc.), our preset hyper-parameters are simply far off optimal values, I guess.
@czy97 We previously suggested to change the batchsize to speed up your training, such that batchsize batchsize_per_gpu remains 1024 (more strictly, batchsize batchsize_per_gpu * num_GPUs remains 1024), but it turned out that we cannot reproduce the same/similar results in that way with e.g. batchsize:512, batchsize_per_gpu:2. We sometimes got a very bad result as you did. We are looking into this issue.
In the meanwhile, we found that if you change the chunk_size: 150 to chunk_size: 500 in conf/train.yaml and use 1 GPU for the training, you can speed up the training, and obtain OK-ish result (something like 12.98% for unknown Spk# conditions). But this report is not the final one. We'll get back to you once we find what really is the problem there and the solution for it.
Thanks for the reply. Looking forward to the final solution.
@nttcslab-sp-admin yes I was training with 8GPUs. So current code does not support multi GPU training? I also tried 2 speaker without overlap for 100 epochs. The Mean loss is around 0.46 and DER is 45.19. I ran the inference on dev set and cut the audio according to the rttm and found the reulst really bad, in some result even only got 1 speaker. I checked the EEND https://github.com/hitachi-speech/EEND/issues/4 and they said they never validate on data less than 100hours. So, I was confusing what lead to the problem. multi GPUs or the training set was too small.
@kli017 Well, the code supports multi-GPU training in a way that it does not crash. But if you use N GPUs, the actual batchsize the program use is going to be N times bigger. And in that case, our preset hyper-parameters such as batchsize, warmup steps etc will not be optimal anymore, and you will get quite bad results sometimes. Our current preset hyper-parameters assumes 1GPU training. This issue is actually related to the issue czy97 raised. We're trying to find a solution and good hyper-parameter settings for the case you use large batchsize and multiple GPUs (to speed up training), based on the current code.
Understood. Thanks, I'll try with 1GPU first and looking forward the solution!
We tries several multi-GPU configurations but could not find a good training configuration that can closely match or surpass our 1-GPU training results. However, since the purpose of this repository is to reproduce the CALLHOME results of our paper (which we can do with 1-GPU training in this repo), let me close this issue.
Hello, do you have some recommendations on how to speed up the training on a single GPU? Trying with larger chunk_size now @nttcslab-sp-admin
When I run the callhome recipe using the default config file, the GPU utility is extremely low (less than 10%). Is this normal?