sony / ai-research-code

Apache License 2.0
348 stars 66 forks source link

【NVC-Net】How many epochs will the model converge? #41

Closed Charlottecuc closed 2 years ago

Charlottecuc commented 2 years ago

e.g. For the VTCK dataset

Besides, have you tested whether the model is robust with noisy source files (e.g. recorded by mobile phone, with background of air conditioning, or heavy breathing, which is quite common in real life application) at inference time?

Thank you very much

bacnguyencong-sony commented 2 years ago

Yes, it's interesting to see this. However, we haven't tested the model in noisy audio.

On the VCTK dataset, we trained with 400 epochs.

Charlottecuc commented 2 years ago

@TE-BacNguyenCong Hi, is it possible to share us the losses of your model (at about 400 epochs)? Thank you very much

Charlottecuc commented 2 years ago

@TE-BacNguyenCong Beisdes, I'm using 8 V100 cards to train the default model but the GPU utilization is quite low (8% average, about 4 or 5 hour per epoch), have you also encountered such problem?

Charlottecuc commented 2 years ago

@TE-BacNguyenCong Is set with_memory_cache and with_file_cache to be True a good idea to speed up the training process?

bacnguyencong-sony commented 2 years ago

Beisdes, I'm using 8 V100 cards to train the default model but the GPU utilization is quite low (8% average, about 4 or 5 hour per epoch), have you also encountered such problem?

This is strange. We used 4 V100 GPUs and training took around 15 minutes per epoch. I guess the overhead could be I/O operations (reading files, etc, ...)

bacnguyencong-sony commented 2 years ago

Is set with_memory_cache and with_file_cache to be True a good idea to speed up the training process?

No, because inputs are segments randomly sampled per iteration and we don't want to have the same segments all the time.

Charlottecuc commented 2 years ago

Beisdes, I'm using 8 V100 cards to train the default model but the GPU utilization is quite low (8% average, about 4 or 5 hour per epoch), have you also encountered such problem?

This is strange. We used 4 V100 GPUs and training took around 15 minutes per epoch. I guess the overhead could be I/O operations (reading files, etc, ...)

I checked and found that the average time for dataloading is about 0.001s, but the backward procedure is time-consuming:

Average time, batch size 4, V100 16G; dataloading_time: 0.00104; train_discriminator_forward: 2.99111; train_discriminator_backward: 1.34156; train_generator_forward: 5.534698; train_generator_backward: 10.752684; total_average_time_per_batch: 20.924966 Could you give any suggestion for ways of increasing training speed? Thank you very much @TE-BacNguyenCong

Charlottecuc commented 2 years ago

I also tested the speed on eight new 32G V100 cards (batch size 10, default NVCnet code, default VCTK dataset, dafault docker nnabla cuda image), the average training speed can reached: dataloading time: 0.0016777515411376953 train_discriminator_forward: 1.6020491123199463 train_discriminator_backward: 0.7688419818878174 train_generator_forward: 2.8338851928710938 train_generator_backward: 6.0216124057769775 total_average_time_per_batch: 12.376032829284668

It seems that the speed is also very slow.

Charlottecuc commented 2 years ago

Solved after upgrading the driver.

Shayne-Ada commented 1 year ago

@Charlottecuc I got the same problem, about 4 or 5 hour per epoch. Could u tell me how to solve this problem? And CUDA version is 11.0.

yt605155624 commented 1 year ago

@Charlottecuc have you trained this model? does it reproduce the result of demo page?