Closed meanna closed 3 years ago
Please take a look at my colab notebook https://colab.research.google.com/drive/14T3BGfHOReG6GXtD7xFqoidVZPQW2iIK?usp=sharing
Hello @meanna,
Thank you for your interest in our English-Thai Machine Translation dataset.
Regarding the first issue, __"RuntimeError: "bernoulli_scalarcpu" not implemented for 'Half's"__, this due to the first argument of the fairseq training script ./scripts/fairseq_train.transformer_base.single_gpu.fp16.sh
. The first argument indicates the GPUs that will be used for the model training (e.g. 0
for the first GPU, 1
for the second one). For the machine with 1 GPU (e.g. Google Colab), this issue can be resolved by changing the first argument from 7
to 0
. Then, fairseq will choose the first GPU of the machine to train model.
from
!bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16.sh 7 ./dataset/binarized/scb-mt-en-th-2020/en-th/spm-spm/32000-joined/ scb-mt-en-th-2020/en-th/spm-spm/32000-joined 9750 150
to
!bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16.sh 0 ./dataset/binarized/scb-mt-en-th-2020/en-th/spm-spm/32000-joined/ scb-mt-en-th-2020/en-th/spm-spm/32000-joined 9750 150
Thank you so much @artificiala for your quick reply,
Training on Colab works now but when I run the experiment on a GPU-server, I still get the same error. Do you have any idea, what can be the cause of it? Could it be fairseq or sentence piece or other things besides the GPU?
Below was the message I got when I ran bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16.sh 0 ./dataset/binarized/scb-mt-en-th-2020/th-en/spm-spm/32000-joined/ scb-mt-en-th-2020/th-en/spm-spm/32000-joined 9750 150
on a GPU-server with 1 GPU.
2020-08-30 16:16:09 | INFO | fairseq_cli.train | model transformer, criterion LabelSmoothedCrossEntropyCriterion 2020-08-30 16:16:09 | INFO | fairseq_cli.train | num. model params: 74047488 (num. trained: 74047488) 2020-08-30 16:16:10 | INFO | fairseq.trainer | detected shared parameter: decoder.embed_tokens.weight <- decoder.output_projection.weight 2020-08-30 16:16:10 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs) 2020-08-30 16:16:10 | INFO | fairseq_cli.train | max tokens per GPU = 9750 and max sentences per GPU = None 2020-08-30 16:16:10 | INFO | fairseq.trainer | no existing checkpoint found ./checkpoints/scb-mt-en-th-2020/th-en/spm-spm/32000-joined/checkpoint_last.pt 2020-08-30 16:16:10 | INFO | fairseq.trainer | loading train data for epoch 1 2020-08-30 16:16:10 | INFO | fairseq.data.data_utils | loaded 801402 examples from: ./dataset/binarized/scb-mt-en-th-2020/th-en/spm-spm/32000-joined/train.th-en.th 2020-08-30 16:16:10 | INFO | fairseq.data.data_utils | loaded 801402 examples from: ./dataset/binarized/scb-mt-en-th-2020/th-en/spm-spm/32000-joined/train.th-en.en 2020-08-30 16:16:10 | INFO | fairseq.tasks.translation | ./dataset/binarized/scb-mt-en-th-2020/th-en/spm-spm/32000-joined/ train th-en 801402 examples .... RuntimeError: "bernoulli_scalarcpu" not implemented for 'Half'
Hi @meanna,
Which GPU and CUDA version do you use?
You could run the following python script on the server to output the information. python collect_env.py
https://github.com/pytorch/pytorch/blob/master/torch/utils/collect_env.py
Hi @artificiala ,
Below is the output of the script.
Collecting environment information... PyTorch version: 1.6.0 Is debug build: False CUDA used to build PyTorch: 10.2 OS: Ubuntu 16.04.6 LTS (x86_64) GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 Clang version: Could not collect CMake version: version 3.17.3 Python version: 3.6 (64-bit runtime) Is CUDA available: False CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce GTX 1080 Nvidia driver version: 418.39 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.4 Versions of relevant libraries: [pip3] numpy==1.19.1 [pip3] torch==1.6.0 [conda] numpy 1.19.1 pypi_0 pypi [conda] torch 1.6.0 pypi_0 pypi
I think that it is due to incompatibility between CUDA version (10.2) and Nvidia driver version 418.39 (refer to https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver).
You may need to uninstall torch
and reinstall with the following version.
pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
Seems to work now! :) Thank you so much @artificiala ♥♥♥♥
Hello, I'm trying to replicate the experiment with th-en(spm-spm) according to
https://github.com/vistec-AI/thai2nmt/blob/master/experiments/TBASE.SCB-1M.md
. As I executed the training commandbash ./scripts/fairseq_train.transformer_base.single_gpu.fp16.sh 3 ./dataset/binarized/scb-mt-en-th-2020/th-en/spm-spm/32000-joined/ scb-mt-en-th-2020/th-en/spm-spm/32000-joined 9750 150
, I got the following error:... RuntimeError: "bernoulli_scalarcpu" not implemented for 'Half's
And when I ran the training command for newmm→moses (th-en), I got the following error message: .... 2020-08-29 15:44:22 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass Traceback (most recent call last): File "/usr/local/bin/fairseq-train", line 33, in
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
File "/content/drive/My Drive/Colab Notebooks/Thai_MT/thai2nmt/fairseq/fairseq_cli/train.py", line 347, in cli_main
cli_main_helper(args)
File "/content/drive/My Drive/Colab Notebooks/Thai_MT/thai2nmt/fairseq/fairseq_cli/train.py", line 385, in cli_main_helper
main(args)
File "/content/drive/My Drive/Colab Notebooks/Thai_MT/thai2nmt/fairseq/fairseq_cli/train.py", line 125, in main
valid_losses, should_stop = train(args, trainer, task, epoch_itr)
File "/usr/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, *kwds)
File "/content/drive/My Drive/Colab Notebooks/Thai_MT/thai2nmt/fairseq/fairseq_cli/train.py", line 216, in train
log_output = trainer.train_step(samples)
File "/usr/lib/python3.6/contextlib.py", line 52, in inner
return func(args, kwds)
File "/content/drive/My Drive/Colab Notebooks/Thai_MT/thai2nmt/fairseq/fairseq/trainer.py", line 494, in train_step
self.optimizer.multiply_grads(self.data_parallel_world_size / sample_size)
ZeroDivisionError: float division by zero**
Do you know what I can do about it or what could be the cause of these errors ? I'm using google Colab to do the experiments. It is a part of my university project which I have to summit soon. I would be very grateful if you could help me out with this. Thanks a lot in advance.
Suteera