microsoft / GLUECoS

A benchmark for code-switched NLP, ACL 2020
https://microsoft.github.io/GLUECoS
MIT License
73 stars 57 forks source link

MT baseline training error #42

Closed vibhavagarwal5 closed 3 years ago

vibhavagarwal5 commented 3 years ago

I'm trying to run bash train.sh facebook/mbart-large-cc25 mbart MT_EN_HI but I'm getting the following error:


Fine-tuning facebook/mbart-large-cc25 on MT_EN_HI
03/22/2021 19:49:10 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: True
03/22/2021 19:49:10 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/tmp/mt_model', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=4, per_device_eval_batch_size=2, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Mar22_19-49-10_iiitb-ThinkStation-P920', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1500, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=6000, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', fp16_backend='auto', fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1500, dataloader_num_workers=0, past_index=-1, run_name='/tmp/mt_model', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, sortish_sampler=False, predict_with_generate=True)
03/22/2021 19:49:11 - WARNING - datasets.builder -   Using custom data configuration default-48b337e1e0f30e1a
03/22/2021 19:49:11 - WARNING - datasets.builder -   Reusing dataset csv (/home/vibhav/.cache/huggingface/datasets/csv/default-48b337e1e0f30e1a/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
loading configuration file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/config.json from cache at /home/vibhav/.cache/huggingface/transformers/36135304685d914515720daa48fc1adae57803e32ab82d5bde85ef78479e9765.b548f7e307531070391a881374674824b374f829e5d8f68857012de63fe2681a
Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "MBartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_length": 1024,
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 5,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "task_specific_params": {
    "translation_en_to_ro": {
      "decoder_start_token_id": 250020
    }
  },
  "transformers_version": "4.4.2",
  "use_cache": true,
  "vocab_size": 250027
}

loading configuration file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/config.json from cache at /home/vibhav/.cache/huggingface/transformers/36135304685d914515720daa48fc1adae57803e32ab82d5bde85ef78479e9765.b548f7e307531070391a881374674824b374f829e5d8f68857012de63fe2681a
Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "MBartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_length": 1024,
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 5,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "task_specific_params": {
    "translation_en_to_ro": {
      "decoder_start_token_id": 250020
    }
  },
  "transformers_version": "4.4.2",
  "use_cache": true,
  "vocab_size": 250027
}

loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/sentencepiece.bpe.model from cache at /home/vibhav/.cache/huggingface/transformers/83d419fb34e90155a8d95f7799f7a7316a327dc28c7ee6bee15b5a62d3c5ca6b.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/tokenizer_config.json from cache at None
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/tokenizer.json from cache at /home/vibhav/.cache/huggingface/transformers/16e85cac0e7a8c2938ac468199d0adff7483341305c7e848063b72dcf5f22538.39607a8bede9bcd2666ea442230a9d382f57e4fea127c9cc5b6fc6caf527d682
loading weights file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/pytorch_model.bin from cache at /home/vibhav/.cache/huggingface/transformers/58963b41815ac5618d9910411e018d60a3ae7d4540a66e6cf70adf29a748ca1b.bef0d2e3352d6c4bf1213c6207738ec5ecf458de355c65b2aead6671bc612138
All model checkpoint weights were used when initializing MBartForConditionalGeneration.

All the weights of MBartForConditionalGeneration were initialized from the model checkpoint at facebook/mbart-large-cc25.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MBartForConditionalGeneration for predictions without further training.
03/22/2021 19:50:10 - WARNING - datasets.arrow_dataset -   Loading cached processed dataset at /home/vibhav/.cache/huggingface/datasets/csv/default-48b337e1e0f30e1a/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0/cache-a5481e6d57bfbce0.arrow
03/22/2021 19:50:10 - WARNING - datasets.arrow_dataset -   Loading cached processed dataset at /home/vibhav/.cache/huggingface/datasets/csv/default-48b337e1e0f30e1a/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0/cache-41fcbb9a83397c4b.arrow
Using amp fp16 backend
***** Running training *****
  Num examples = 8060
  Num Epochs = 5
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 10075
  0%|                                                                                                                                           | 0/10075 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/hdd1/vibhav/Thesis/GLUECoS/Code/run_seq2seq.py", line 584, in <module>
    main()
  File "/home/hdd1/vibhav/Thesis/GLUECoS/Code/run_seq2seq.py", line 529, in main
    train_result = trainer.train()
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/trainer.py", line 1053, in train
    tr_loss += self.training_step(model, inputs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/trainer.py", line 1441, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/trainer.py", line 1475, in compute_loss
    outputs = model(**inputs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 1303, in forward
    return_dict=return_dict,
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 1166, in forward
    return_dict=return_dict,
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 803, in forward
    output_attentions=output_attentions,
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 317, in forward
    output_attentions=output_attentions,
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 181, in forward
    query_states = self.q_proj(hidden_states) * self.scaling
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`
  0%|                                                                                                                                           | 0/10075 [00:00<?, ?it/s]```
vibhavagarwal5 commented 3 years ago

Btw, this is running perfectly on CPU, but on GPU, getting this error.

vibhavagarwal5 commented 3 years ago

I fixed this by downgrading the pytorch to 1.6 but I'm not able to load it even with batch size 1 on my 2080 TI GPU (12GB). @Genius1237 ?

Genius1237 commented 3 years ago

The first issue that you've encountered is likely due to some mismatch/error with cuda versions and the version of pytorch you've used. Please try to upgrade to the CUDA 11 and Pytorch 1.7+ and see if the error goes away. The code was tested on pytorch 1.7.

I believe that batch size of 4 worked on a 16 GB Tesla P100. Could you check if the fp16 flag is set? A batch size of 2 should definitely work on your 2080Ti. Try upgrading to CUDA 11 as well, as there are a lot of improvements that come in with that.

vibhavagarwal5 commented 3 years ago

fp16 flag is set, my cuda version is 10.2 and this works fine for pytorch 1.7 as well (but not for 1.8). I'm just skeptical of upgrading the cuda version as it messes up sometimes. But is there anything else that I should check?

Genius1237 commented 3 years ago

Could you please share the stdout/logs once? I really cannot think of anything else that could be the issue.

How did you install pytorch? If you know the exact command, please share that. If not, atleast the source from where it came from (anaconda/pypi repos/torch website repo).

vibhavagarwal5 commented 3 years ago

PyTorch was installed as conda install pytorch==1.6.0 cudatoolkit=10.2 -c pytorch

Fine-tuning facebook/mbart-large-cc25 on MT_EN_HI
03/25/2021 12:58:33 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: True
03/25/2021 12:58:33 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/tmp/mt_model', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Mar25_12-58-33_iiitb-ThinkStation-P920', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1500, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=6000, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', fp16_backend='auto', fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1500, dataloader_num_workers=0, past_index=-1, run_name='/tmp/mt_model', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, sortish_sampler=False, predict_with_generate=True)
03/25/2021 12:58:34 - WARNING - datasets.builder -   Using custom data configuration default-48b337e1e0f30e1a
03/25/2021 12:58:34 - WARNING - datasets.builder -   Reusing dataset csv (/home/vibhav/.cache/huggingface/datasets/csv/default-48b337e1e0f30e1a/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
loading configuration file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/config.json from cache at /home/vibhav/.cache/huggingface/transformers/36135304685d914515720daa48fc1adae57803e32ab82d5bde85ef78479e9765.b548f7e307531070391a881374674824b374f829e5d8f68857012de63fe2681a
Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "MBartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_length": 1024,
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 5,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "task_specific_params": {
    "translation_en_to_ro": {
      "decoder_start_token_id": 250020
    }
  },
  "transformers_version": "4.4.2",
  "use_cache": true,
  "vocab_size": 250027
}

loading configuration file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/config.json from cache at /home/vibhav/.cache/huggingface/transformers/36135304685d914515720daa48fc1adae57803e32ab82d5bde85ef78479e9765.b548f7e307531070391a881374674824b374f829e5d8f68857012de63fe2681a
Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "MBartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_length": 1024,
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 5,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "task_specific_params": {
    "translation_en_to_ro": {
      "decoder_start_token_id": 250020
    }
  },
  "transformers_version": "4.4.2",
  "use_cache": true,
  "vocab_size": 250027
}

loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/sentencepiece.bpe.model from cache at /home/vibhav/.cache/huggingface/transformers/83d419fb34e90155a8d95f7799f7a7316a327dc28c7ee6bee15b5a62d3c5ca6b.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/tokenizer_config.json from cache at None
loading file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/tokenizer.json from cache at /home/vibhav/.cache/huggingface/transformers/16e85cac0e7a8c2938ac468199d0adff7483341305c7e848063b72dcf5f22538.39607a8bede9bcd2666ea442230a9d382f57e4fea127c9cc5b6fc6caf527d682
loading weights file https://huggingface.co/facebook/mbart-large-cc25/resolve/main/pytorch_model.bin from cache at /home/vibhav/.cache/huggingface/transformers/58963b41815ac5618d9910411e018d60a3ae7d4540a66e6cf70adf29a748ca1b.bef0d2e3352d6c4bf1213c6207738ec5ecf458de355c65b2aead6671bc612138
All model checkpoint weights were used when initializing MBartForConditionalGeneration.

All the weights of MBartForConditionalGeneration were initialized from the model checkpoint at facebook/mbart-large-cc25.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MBartForConditionalGeneration for predictions without further training.
03/25/2021 12:59:25 - WARNING - datasets.arrow_dataset -   Loading cached processed dataset at /home/vibhav/.cache/huggingface/datasets/csv/default-48b337e1e0f30e1a/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0/cache-a5481e6d57bfbce0.arrow
03/25/2021 12:59:25 - WARNING - datasets.arrow_dataset -   Loading cached processed dataset at /home/vibhav/.cache/huggingface/datasets/csv/default-48b337e1e0f30e1a/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0/cache-41fcbb9a83397c4b.arrow
Using amp fp16 backend
***** Running training *****
  Num examples = 8060
  Num Epochs = 5
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 40300
  0%|                                                                                                                                           | 0/40300 [00:00<?, ?it/s]/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
  0%|                                                                                                                                 | 6/40300 [00:01<2:40:49,  4.18it/s]Traceback (most recent call last):
  File "/home/hdd1/vibhav/Thesis/GLUECoS/Code/run_seq2seq.py", line 584, in <module>
    main()
  File "/home/hdd1/vibhav/Thesis/GLUECoS/Code/run_seq2seq.py", line 529, in main
    train_result = trainer.train()
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/trainer.py", line 1053, in train
    tr_loss += self.training_step(model, inputs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/trainer.py", line 1441, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/trainer.py", line 1475, in compute_loss
    outputs = model(**inputs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 1303, in forward
    return_dict=return_dict,
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 1189, in forward
    return_dict=return_dict,
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 1057, in forward
    use_cache=use_cache,
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 412, in forward
    output_attentions=output_attentions,
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hdd1/vibhav/anaconda3/envs/gluecos/lib/python3.7/site-packages/transformers/models/mbart/modeling_mbart.py", line 268, in forward
    .reshape(bsz, tgt_len, embed_dim)
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 10.76 GiB total capacity; 9.86 GiB already allocated; 3.94 MiB free; 9.88 GiB reserved in total by PyTorch)
  0%|                                                                                                                                 | 6/40300 [00:01<2:45:28,  4.06it/s]
Genius1237 commented 3 years ago

I've never used the cuda/cudnn installs that come in conda. I hope that they aren't old and are up to date. Please check that. Also, given that you're installing them inside a conda env, there should be no harm in trying out a newer version of cuda in a separate environment.

At this point, I really do not have any suggestion than updating the cuda and torch to the latest versions and trying a smaller batch size. Let me know how this goes. If things still do not work out, I can look into reducing the memory footprint in some way.

Genius1237 commented 3 years ago

Are you using the 2080Ti for video output as well? It says that only 10.7 gigs is the total capacity. Xorg and the desktop environment take up some vram in that case. You can claw back a bit more VRAM if you are able to run the system headless.

vibhavagarwal5 commented 3 years ago

True, that system has video outputs as well. I'll have a look at your suggestions and let you know.

vibhavagarwal5 commented 3 years ago

Hey @Genius1237 just wanted to update you that cuda update is having the same out of memory error. Also seems like the seq2seq script is not working just like that, I had to change the Autotokenizer to MBARTtokenizer in order to run it. Batch size of 1 is also not being loaded on 2080 TI (12gb). :/

Genius1237 commented 3 years ago

I do not have a lot of free time at hand now, so I am not able to look into this currently. I know that atleast one other person has been able to get this running, so I would suggest that you try looking into this yourself or try running it on a different system.

Genius1237 commented 3 years ago

As for the issue with AutoTokenizer/MBARTTokenizer, please check if the model name (facebook/mbart-large-cc25) is being passed correctly. It uses this to pick the correct tokenizer. That or it could be an issue with the version of the transformers library.

vibhavagarwal5 commented 3 years ago

I understand that @Genius1237. I was able to run it only on a 16GB VRAM GPU, and it filled up 15GB space with just the batch size of 4.