Error in training th-en(spm-spm)

meanna commented 4 years ago

Hello, I'm trying to replicate the experiment with th-en(spm-spm) according to https://github.com/vistec-AI/thai2nmt/blob/master/experiments/TBASE.SCB-1M.md. As I executed the training command bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16.sh 3 ./dataset/binarized/scb-mt-en-th-2020/th-en/spm-spm/32000-joined/ scb-mt-en-th-2020/th-en/spm-spm/32000-joined 9750 150, I got the following error:

... RuntimeError: "bernoulli_scalarcpu" not implemented for 'Half's

And when I ran the training command for newmm→moses (th-en), I got the following error message: .... 2020-08-29 15:44:22 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass Traceback (most recent call last): File "/usr/local/bin/fairseq-train", line 33, in sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()) File "/content/drive/My Drive/Colab Notebooks/Thai_MT/thai2nmt/fairseq/fairseq_cli/train.py", line 347, in cli_main cli_main_helper(args) File "/content/drive/My Drive/Colab Notebooks/Thai_MT/thai2nmt/fairseq/fairseq_cli/train.py", line 385, in cli_main_helper main(args) File "/content/drive/My Drive/Colab Notebooks/Thai_MT/thai2nmt/fairseq/fairseq_cli/train.py", line 125, in main valid_losses, should_stop = train(args, trainer, task, epoch_itr) File "/usr/lib/python3.6/contextlib.py", line 52, in inner return func(*args, *kwds) File "/content/drive/My Drive/Colab Notebooks/Thai_MT/thai2nmt/fairseq/fairseq_cli/train.py", line 216, in train log_output = trainer.train_step(samples) File "/usr/lib/python3.6/contextlib.py", line 52, in inner return func(args, kwds) File "/content/drive/My Drive/Colab Notebooks/Thai_MT/thai2nmt/fairseq/fairseq/trainer.py", line 494, in train_step self.optimizer.multiply_grads(self.data_parallel_world_size / sample_size) ZeroDivisionError: float division by zero**

Do you know what I can do about it or what could be the cause of these errors ? I'm using google Colab to do the experiments. It is a part of my university project which I have to summit soon. I would be very grateful if you could help me out with this. Thanks a lot in advance.

Suteera

meanna commented 4 years ago

Please take a look at my colab notebook https://colab.research.google.com/drive/14T3BGfHOReG6GXtD7xFqoidVZPQW2iIK?usp=sharing

lalital commented 4 years ago

Hello @meanna,

Thank you for your interest in our English-Thai Machine Translation dataset.

Regarding the first issue, __"RuntimeError: "bernoulli_scalarcpu" not implemented for 'Half's"__, this due to the first argument of the fairseq training script ./scripts/fairseq_train.transformer_base.single_gpu.fp16.sh. The first argument indicates the GPUs that will be used for the model training (e.g. 0 for the first GPU, 1 for the second one). For the machine with 1 GPU (e.g. Google Colab), this issue can be resolved by changing the first argument from 7 to 0. Then, fairseq will choose the first GPU of the machine to train model.

from

!bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16.sh 7 ./dataset/binarized/scb-mt-en-th-2020/en-th/spm-spm/32000-joined/ scb-mt-en-th-2020/en-th/spm-spm/32000-joined 9750 150

to

!bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16.sh 0 ./dataset/binarized/scb-mt-en-th-2020/en-th/spm-spm/32000-joined/ scb-mt-en-th-2020/en-th/spm-spm/32000-joined 9750 150

meanna commented 4 years ago

Thank you so much @artificiala for your quick reply,

Training on Colab works now but when I run the experiment on a GPU-server, I still get the same error. Do you have any idea, what can be the cause of it? Could it be fairseq or sentence piece or other things besides the GPU?

Below was the message I got when I ran bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16.sh 0 ./dataset/binarized/scb-mt-en-th-2020/th-en/spm-spm/32000-joined/ scb-mt-en-th-2020/th-en/spm-spm/32000-joined 9750 150 on a GPU-server with 1 GPU.

lalital commented 4 years ago

Hi @meanna,

Which GPU and CUDA version do you use?

You could run the following python script on the server to output the information. python collect_env.py

https://github.com/pytorch/pytorch/blob/master/torch/utils/collect_env.py

meanna commented 4 years ago

Hi @artificiala ,

Below is the output of the script.

Collecting environment information... PyTorch version: 1.6.0 Is debug build: False CUDA used to build PyTorch: 10.2 OS: Ubuntu 16.04.6 LTS (x86_64) GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 Clang version: Could not collect CMake version: version 3.17.3 Python version: 3.6 (64-bit runtime) Is CUDA available: False CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce GTX 1080 Nvidia driver version: 418.39 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.4 Versions of relevant libraries: [pip3] numpy==1.19.1 [pip3] torch==1.6.0 [conda] numpy 1.19.1 pypi_0 pypi [conda] torch 1.6.0 pypi_0 pypi

lalital commented 4 years ago

I think that it is due to incompatibility between CUDA version (10.2) and Nvidia driver version 418.39 (refer to https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver).

You may need to uninstall torch and reinstall with the following version.

pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

meanna commented 4 years ago

Seems to work now! :) Thank you so much @artificiala ♥♥♥♥

vistec-AI / thai2nmt

Error in training th-en(spm-spm) #1