microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.62k stars 2.5k forks source link

LayoutLM - Cuda Device Ordinal error #511

Open gregg-ADP opened 2 years ago

gregg-ADP commented 2 years ago

Describe the bug We are running the file unilm/layoutlmft/examples/run_xfun_re.py However, we get the error RuntimeError: CUDA error: invalid device ordinal torch._C._cuda_setDevice(device)

Model I am using: LayoutLM

The problem arises when using:

A clear and concise description of what the bug is. We are running the file unilm/layoutlmft/examples/run_xfun_re.py without changes and exactly as described in the instructions. However, we get the error RuntimeError: CUDA error: invalid device ordinal torch._C._cuda_setDevice(device)

Software Versions: Python 3.7.10 Cuda Version 10.2 PyTorch Version 1.8.0 TorchVision 0.9.0

python -m torch.distributed.launch --nproc_per_node=4 examples/run_xfun_re.py \
--model_name_or_path microsoft/layoutxlm-base \
--output_dir /tmp/test-ner \
--do_train \
--do_eval \
--lang zh \
--max_steps 2500 \
--per_device_train_batch_size 2 \
--warmup_ratio 0.1 \
--fp16

To Reproduce Steps to reproduce the behavior:

  1. Run the program using the same command documented here: https://github.com/microsoft/unilm/tree/master/layoutxlm#fine-tuning-for-relation-extraction
  2. Execution starts and terminates quickly with the error RuntimeError: CUDA error: invalid device ordinal torch._C._cuda_setDevice(device)

Expected behavior We expect the training to run through since we are trying to run the code in the example without any changes and with the same command.

Stack Trace:

WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
INFO:__main__:Training/evaluation parameters TrainingArguments(output_dir=/tmp/test-ner, overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=2500, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.1, warmup_steps=0, logging_dir=runs/Nov09_21-28-25_ip-10-205-48-185, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=/tmp/test-ner, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name=length, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, _n_gpu=1, mp_parameters=)
Traceback (most recent call last):
  File "examples/run_xfun_re.py", line 246, in <module>
Traceback (most recent call last):
  File "examples/run_xfun_re.py", line 246, in <module>
    main()
  File "examples/run_xfun_re.py", line 50, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
    main()
  File "examples/run_xfun_re.py", line 50, in main
    obj = dtype(**inputs)
  File "<string>", line 67, in __init__
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/transformers/training_args.py", line 570, in __post_init__
    obj = dtype(**inputs)
  File "<string>", line 67, in __init__
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/transformers/training_args.py", line 570, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/transformers/training_args.py", line 717, in device
    return func(*args, **kwargs)
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/transformers/training_args.py", line 717, in device
    return self._setup_devices
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/transformers/file_utils.py", line 1460, in __get__
    return self._setup_devices
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/transformers/file_utils.py", line 1460, in __get__
    cached = self.fget(obj)
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
    cached = self.fget(obj)
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/transformers/training_args.py", line 707, in _setup_devices
    return func(*args, **kwargs)
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/transformers/training_args.py", line 707, in _setup_devices
    torch.cuda.set_device(device)
  File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/torch/cuda/__init__.py", line 261, in set_device
    torch.cuda.set_device(device)
      File "/home/ubuntu/AegisQATrainer/aegis-layoutlmft/lib/python3.7/site-packages/torch/cuda/__init__.py", line 261, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal

FYI @NielsRogge

gregg-ADP commented 2 years ago

Correction: We realized that we had to change the requirements.txt file to

# -f https://download.pytorch.org/whl/torch_stable.html
# -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.7/index.html

datasets==1.6.2
torch==1.8
torchvision==0.9.0
transformers==4.5.1
# detectron2==0.3
seqeval==1.2.2

And, we installed detectron2 0.6 using this command: python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

We had to change the torch version because detectron 0.6 required it. Also, we had to try version 0.6 because 0.3 does not install.

Problems we had while trying to install the Detectron2 0.3 where the following. (This is why we are using 0.6 above.) We get this when trying to install with torch 1.7.1 and torchvision 0.8.2:

python -m pip install detectron2==0.3 -f   https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.7/index.html
Looking in links: https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.7/index.html
ERROR: Could not find a version that satisfies the requirement detectron2==0.3 (from versions: none)
ERROR: No matching distribution found for detectron2==0.3

And we tried this with torch 1.7.1 and torchvision 0.8.2:

python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.7/index.html
Looking in links: https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.7/index.html
ERROR: Could not find a version that satisfies the requirement detectron2 (from versions: none)
ERROR: No matching distribution found for detectron2

Also, found the actual GPU type: NVIDIA K80 GPUs This is a p2.xlarge instance type.

jamcdon4 commented 2 years ago

Root cause here is not versioning. See _setup_devices function @ source code https://huggingface.co/transformers/v3.3.1/_modules/transformers/training_args.html

Fix: If you are working with one gpu, before running the example script: !export CUDA_VISIBLE_DEVICES=0

This should fix the issue

SaadAhmad376 commented 1 year ago

@jamcdon4 that does not work for me, I have tried this before but still same error