pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 478 forks source link

fairseq wav2vec in TPU, but no gradient backward? #2681

Closed xwuShirley closed 3 years ago

xwuShirley commented 3 years ago

🐛 Bug

I try to run the fairseq code for wav2vec example using TPU V8 https://github.com/pytorch/fairseq/tree/master/examples/wav2vec

BUt got the following error: No gradient backward

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [XLAFloatType [216, 1024, 1, 48]$, which is output 0 of UnsqueezeBackward0, is at version 5; expected version 4 instead. Hint: enable anomaly detection to find the operation t$at failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

2020-12-11 02:31:38 | INFO | fairseq_cli.train | task: AudioPretrainingTask
2020-12-11 02:31:38 | INFO | fairseq_cli.train | model: Wav2VecCtc 2020-12-11 02:31:38 | INFO | fairseq_cli.train | criterion: CtcCriterion)
2020-12-11 02:31:38 | INFO | fairseq_cli.train | num. model params: 315471520 (num. trained: 315471520)
2020-12-11 02:31:43 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
2020-12-11 02:31:43 | INFO | fairseq_cli.train | max tokens per GPU = 3400000 and batch size per GPU = None
2020-12-11 02:31:43 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last.pt
2020-12-11 02:31:43 | INFO | fairseq.trainer | loading train data for epoch 1
2020-12-11 02:31:43 | INFO | fairseq.data.audio.raw_audio_dataset | loaded 33139, skipped 0 samples
2020-12-11 02:31:44 | INFO | fairseq.trainer | begin training epoch 1

Exception in device=TPU:5: one of the variables needed for gradient computation has been modified by an inplace operation: [XLAFloatType [216, 1024, 1, 48]], which is output 0 of UnsqueezeBackward0, is at version 5; expected version 4 instead. Hint: enable anomaly detection to find th$ operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Traceback (most recent call last): File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args) File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args) File "/home/user/fairseq/fairseq/distributed_utils.py", line 302, in distributed_main
main(cfg, kwargs) File "/home/user/fairseq/fairseq_cli/train.py", line 138, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/root/anaconda3/envs/pytorch/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, *kwds) File "/home/user/fairseq/fairseq_cli/train.py", line 227, in train
log_output = trainer.train_step(samples) File "/root/anaconda3/envs/pytorch/lib/python3.6/contextlib.py", line 52, in inner
return func(
args,
kwds) File "/home/user/fairseq/fairseq/trainer.py", line 562, in train_step
raise e File "/home/user/fairseq/fairseq/trainer.py", line 536, in train_step
ignore_grad=is_dummy_batch, File "/home/user/fairseq/fairseq/tasks/fairseq_task.py", line 432, in train_step
optimizer.backward(loss) File "/home/user/fairseq/fairseq/optim/fairseq_optimizer.py", line 95, in backward
loss.backward() File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/tensor.py", line 233, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/init.py", line 146, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [XLAFloatType [216, 1024, 1, 48]$, which is output 0 of UnsqueezeBackward0, is at version 5; expected version 4 instead. Hint: enable anomaly detection to find the operation t$at failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Traceback (most recent call last): File "/home/user/fairseq/fairseq_cli/hydra_train.py", line 38, in hydra_main
distributed_utils.call_main(cfg, pre_main) File "/home/user/fairseq/fairseq/distributed_utils.py", line 332, in call_main
nprocs=8, # use all 8 TPU cores File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn start_method=start_method) File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 205, in start_processes while not context.join(): File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 160, in join exit_code=exitcode torch.multiprocessing.spawn.ProcessExitedException: process 5 terminated with exit code 17

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

To Reproduce

docker run -it -v /home:/home --ipc=host  gcr.io/tpu-pytorch/xla:nightly_3.6
pip install soundfile                                                
pip install editdistance
sudo apt-get install libsndfile1

export TPU_IP_ADDRESS=XXX
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
git clone https://github.com/pytorch/fairseq.git
cd  fairseq
pip install --editable ./

wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_new.pt
data_dir=$HOME/samples
model=$HOME/wav2vec_vox_new.pt
python $HOME/fairseq/fairseq_cli/hydra_train.py \
    task.data=${data_dir}  \
    model.w2v_path=${model}  \
    --config-dir $HOME/fairseq-tpu \
    --config-name my_base

You need two additional files to produce the above code: the data and the configuration/the yaml. I put it to this link https://gitlab.com/xwuShirley/fairseq-tpu/-/blob/master

==>the data is sample.zip

after unzip the file, please go to find the file train.tsv and valid.tsv to update the directory of the data (the first line)

==> the yaml is my_base.yaml (this one is originally from https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/config/finetuning/base_100h.yaml I updated with line https://gitlab.com/xwuShirley/fairseq-tpu/-/blob/master/my_base.yaml#L7)

Thank you very much for your help in advance. I was told V3-8 TPU runs much faster than GPU 8XV100 and so decided to give it a try. I am not sure if this is due to the CTC criterion(https://gitlab.com/xwuShirley/fairseq-tpu/-/blob/master/my_base.yaml#L30) since it's not a standard loss.

Best, Shirley

ultrons commented 3 years ago

@xwuShirley , Are you training a wav2vec, vq-wav2vec or vq-wav2vec2 model? In case you are training the first two, can you try this fork.

ultrons commented 3 years ago

If you are using wav2vec2 we can point you to a different fork in progress.

xwuShirley commented 3 years ago

we are training on wav2vec2. But we decided to use A100. Thanks!

awasthiabhijeet commented 3 years ago

Hi @ultrons, Could you please point me to a wav2vec2 fork that works on TPUs?

Thanks :)

ultrons commented 3 years ago

@awasthiabhijeet , The master branch works. It has instruction to run as in the examples/wav2vec README Use config file something like this:

export XRT_TPU_CONFIG="localservice;0;localhost:51011"
OMP_NUM_THREADS=1 fairseq-hydra-train   task.data=/home/sivaibhav/manifest   --config-dir ./examples/wav2vec/config/pretraining   --config-name wav2vec2_large_librivox_tpu.yaml

With one modification: add batch_size = 4 in the dataset section. Let me know if you have any issues.

awasthiabhijeet commented 3 years ago

Hi @ultrons ,

Does it also support supervised fine-tuning using the CTC-loss?

README in examples/wav2vec mentions that "Wav2Vec2 is now supported on TPUs! It's currently pre-training only."

I am looking for a code that allows me to finetune wav2vec2 on TPUs using CTC-loss. CTC-loss provided by PyTorch is currently not lowered in pytorch/xla. (https://github.com/pytorch/xla/issues/2399)

ultrons commented 3 years ago

Currently W2V2 on tpu is only used for pre-training CTC loss is in the fine tuning code which has not been optimized for TPUs yet.