Open benyang0506 opened 1 year ago
Hi, benyang0506.
The training error may be caused by software environment. I met the similar error when I reimplemented the SpeechT5 pre-training on a new environment, then the problem was fixed by installing pytorch of previous version, such as 1.12 or 1.10. Also, ensure that the GPU and CUDA supports / enables FP16 computation.
Best wishes.
Thanks for your reply! I have tried several torch versions, but they only support TensorFloat32 when using baddbmm. I alse used V100, which supports fp16 computation, because I used it on previous work. By the way, I wonder whether there is a big difference about the speed of pretrain between using fp16 and fp32. Thanks!
Hi, Sorry for the late reply. I think using fp16 may improve speed and require less memory but achieve similar performance. In addition, V100 can support fp16 training. I do not have a detailed comparison of fp16 and fp32, but it's better to use fp16 for training.
Here is a comparison from pytorch for your reference. Thanks.
Hi @benyang0506
We found some ways helpful to handle the problem as you mentioned above. Specifically, these attempts are summarized as follows.
conda
and pip
install pytorch
when use conda
(e.g., miniconda) to fairseq-train
.USER_DIR
in the directory of fairseq/examples
and use it as USER_DIR
. The issue occurs at Multi-GPU training doesn't work when --user-dir specified #4875.When we reimplemented the SpeechT5 SID task using torch==1.10.1+cuda113
, we also encounter the same questions as you, e.g., fp16 not works. Helpfully, these attempts can be useful.
Did you solve the problem? I also encountered this problem
@benyang0506
Thanks for your previous reply! But now I encounter another question, when I used fp16 to pretrain, I found an ERROR as follows: It seems that fp16 from fairseq is not adapted to torch