LayoutLMv2 model not supporting training on more than 1 GPU when using PyTorch Data Parallel

theMADAIguy commented 3 years ago

Environment info

transformers version: 4.11.2
Platform: Linux-5.4.0-66-generic-x86_64-with-glibc2.10
Python version: 3.8.8
PyTorch version (GPU?): 1.9.1+cu102 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help

Models: LayoutLMv2 @NielsRogge

Information

Model I am using: LayoutLMv2

The problem arises when using:

my own modified scripts

The tasks I am working on is:

token classification FUNSD

To reproduce

Steps to reproduce the behavior:

Run the below script with more than 1 GPU

from datasets import load_dataset 
import torch
from torch.nn import DataParallel
from PIL import Image
from transformers import LayoutLMv2Processor
from datasets import Features, Sequence, ClassLabel, Value, Array2D, Array3D
from torch.utils.data import DataLoader
from transformers import LayoutLMv2ForTokenClassification, AdamW
import torch
from tqdm.notebook import tqdm
from datasets import load_metric

use_cuda = torch.cuda.is_available()
device= torch.device('cuda:0' if use_cuda else 'cpu')
print(device)
device_ids = [0,1]

datasets = load_dataset("nielsr/funsd")

labels = datasets['train'].features['ner_tags'].feature.names
print(labels)

id2label = {v: k for v, k in enumerate(labels)}
label2id = {k: v for v, k in enumerate(labels)}

##Next, let's use `LayoutLMv2Processor` to prepare the data for the model.

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

# we need to define custom features
features = Features({
    'image': Array3D(dtype="int64", shape=(3, 224, 224)),
    'input_ids': Sequence(feature=Value(dtype='int64')),
    'attention_mask': Sequence(Value(dtype='int64')),
    'token_type_ids': Sequence(Value(dtype='int64')),
    'bbox': Array2D(dtype="int64", shape=(512, 4)),
    'labels': Sequence(ClassLabel(names=labels)),
})

def preprocess_data(examples):
  images = [Image.open(path).convert("RGB") for path in examples['image_path']]
  words = examples['words']
  boxes = examples['bboxes']
  word_labels = examples['ner_tags']

  encoded_inputs = processor(images, words, boxes=boxes, word_labels=word_labels,
                             padding="max_length", truncation=True)

  return encoded_inputs

train_dataset = datasets['train'].map(preprocess_data, batched=True, remove_columns=datasets['train'].column_names,
                                      features=features)
test_dataset = datasets['test'].map(preprocess_data, batched=True, remove_columns=datasets['test'].column_names,
                                      features=features)

processor.tokenizer.decode(train_dataset['input_ids'][0])

print(train_dataset['labels'][0])

##Finally, let's set the format to PyTorch, and place everything on the GPU:

train_dataset.set_format(type="torch", device=device)
test_dataset.set_format(type="torch", device=device)

train_dataset.features.keys()

##Next, we create corresponding dataloaders.

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=2)

##Let's verify a batch:

batch = next(iter(train_dataloader))

for k,v in batch.items():
  print(k, v.shape)

## Train the model
##Here we train the model in native PyTorch. We use the AdamW optimizer.

model = LayoutLMv2ForTokenClassification.from_pretrained('microsoft/layoutlmv2-base-uncased',
                                                          num_labels=len(labels))

if use_cuda:
    model = DataParallel(model,device_ids=device_ids)

model.to(device)

optimizer = AdamW(model.parameters(), lr=5e-5)

global_step = 0
num_train_epochs = 6
t_total = len(train_dataloader) * num_train_epochs # total number of training steps 

#put the model in training mode
model.train() 
for epoch in range(num_train_epochs):  
   print("Epoch:", epoch)
   for batch in tqdm(train_dataloader):
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(**batch) 
        loss = outputs.loss

        # print loss every 100 steps
        if global_step % 100 == 0:
          print(f"Loss after {global_step} steps: {loss.item()}")

        loss.backward()
        optimizer.step()
        global_step += 1

## Evaluation

#Next, let's evaluate the model on the test set.

metric = load_metric("seqeval")

# put model in evaluation mode
model.eval()
for batch in tqdm(test_dataloader, desc="Evaluating"):
    with torch.no_grad():
        input_ids = batch['input_ids'].to(device)
        bbox = batch['bbox'].to(device)
        image = batch['image'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        labels = batch['labels'].to(device)

        # forward pass
        outputs = model(input_ids=input_ids, bbox=bbox, image=image, attention_mask=attention_mask, 
                        token_type_ids=token_type_ids, labels=labels)

        # predictions
        predictions = outputs.logits.argmax(dim=2)

        # Remove ignored index (special tokens)
        true_predictions = [
            [id2label[p.item()] for (p, l) in zip(prediction, label) if l != -100]
            for prediction, label in zip(predictions, labels)
        ]
        true_labels = [
            [id2label[l.item()] for (p, l) in zip(prediction, label) if l != -100]
            for prediction, label in zip(predictions, labels)
        ]

        metric.add_batch(predictions=true_predictions, references=true_labels)

final_score = metric.compute()
print(final_score)

Error

Epoch: 0
  0%|          | 0/38 [00:00<?, ?it/s]
/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  ../aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
Traceback (most recent call last):
  File "llmv2_demo.py", line 111, in <module>
    outputs = model(**batch) 
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 1167, in forward
    outputs = self.layoutlmv2(
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 898, in forward
    visual_emb = self._calc_img_embeddings(
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 762, in _calc_img_embeddings
    visual_embeddings = self.visual_proj(self.visual(image))
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 590, in forward
    features = self.backbone(images_input)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/detectron2/modeling/backbone/fpn.py", line 126, in forward
    bottom_up_features = self.bottom_up(x)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/detectron2/modeling/backbone/resnet.py", line 449, in forward
    x = stage(x)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/detectron2/modeling/backbone/resnet.py", line 195, in forward
    out = self.conv1(x)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/puneetm/anaconda3/lib/python3.8/site-packages/detectron2/layers/wrappers.py", line 84, in forward
    x = F.conv2d(
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking arugment for argument weight in method wrapper_cudnn_convolution)

NielsRogge commented 3 years ago

Hi,

I've answered this question here.

TDLR: you need to first call model.layoutlmv2.visual.synchronize_batch_norm().

theMADAIguy commented 3 years ago

Hi @NielsRogge Thanks for your quick response. I looked at that repo as well just a couple of minutes back. The problem that I face using that solution is it gives this error:

raise RuntimeError("Make sure torch.distributed is set up properly.")
RuntimeError: Make sure torch.distributed is set up properly.

I read the above-linked post. The OP there also faces the same problem and you recommend the following:

You probably first need to call torch.distributed.init_process_group() before starting training.

Using this in the code forces me to implement DistributedDataParallel instead of the conventional DataParallel. Can you suggest something to help further?

It requires setting up the backend, rank, and world_size for DistributedDataParallel. Is this the way to go? Can you give an example of a running script that handles batch synchronization without forcing with DataParallel?

Currently, I have added the following lines of code in my script:

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'

torch.distributed.init_process_group("nccl", rank=0, world_size=2)
model=LayoutLMv2_Classification_model().to(device)
model.LayoutLMv2Encoder.visual.synchronize_batch_norm()

The terminal hangs and there is no output displayed.

Any help on this case will be highly appreciated!! Thanks once again!

NielsRogge commented 3 years ago

Are you running all of this in a notebook or as a script? The authors defined everything in a Python script, which they then launch as follows:

cd layoutlmft
python -m torch.distributed.launch --nproc_per_node=4 examples/run_funsd.py \
        --model_name_or_path microsoft/layoutlmv2-base-uncased \
        --output_dir /tmp/test-ner \
        --do_train \
        --do_predict \
        --max_steps 1000 \
        --warmup_ratio 0.1 \
        --fp16

That's the recommended way to train deep learning models with PyTorch on multiple GPUs. torch.distributed.launch is a helper utility that can be used to launch multiple processes per node for distributed training.

It would be great if we can add an example script for LayoutLMv2/LayoutXLM to the examples folder of HuggingFace Transformers. It would mean updating the Python script for it to work with HuggingFace Transformers instead of the original unilm repository.

Are you interested in contributing this?

NielsRogge commented 3 years ago

Actually, let me mark it as a "good first issue" (this is a good first contribution for people interested in contributing). This way, we can help others fine-tune LayoutLMv2 on multiple GPUs.

harsha070 commented 3 years ago

Shall I take this up ?

NielsRogge commented 3 years ago

@harsha070 would be great! So the goal would be to add an example script that could be called run_layoutlmv2.py that uses the HuggingFace Trainer to fine-tune the model on the FUNSD dataset. You can also create a run_layoutlmv2_no_trainer.py script that leverages HuggingFace Accelerate instead to run on multiple GPUs.

Do you have a setup with more than 1 GPU?

harsha070 commented 3 years ago

Sure. Understood. Yes, I have a multi-GPU setup.

NielsRogge commented 3 years ago

Awesome! You can take a look at the example run_ner.py script (or other example scripts), they all use the HfArgumentParser to automatically parse the command line arguments into model_args, data_args and training_args.

You can also take a look at my example notebooks regarding fine-tuning LayoutLMv2 on the FUNSD dataset. Ideally, we also leverage HuggingFace Datasets, to automatically load the dataset from the hub. I've already uploaded that one a while ago: https://huggingface.co/datasets/nielsr/funsd

Let me know if you need any help!

ArmiNouri commented 3 years ago

Thank you for taking up this much needed suggestion. I've been running the FUNSD trainer with the following parameters:

CUDA_VISIBLE_DEVICES=0,1,2 torchrun --standalone --nnodes=1 --nproc_per_node=3 run_layoutlmv2.py --model_name_or_path microsoft/layoutlmv2-base-uncased --processor_name microsoft/layoutlmv2-base-uncased --output_dir /tmp/test-layoutlmv2 --dataset_name nielsr/funsd --do_train --do_predict --max_steps 1000 --warmup_ratio 0.1 --fp16 --model_revision no_ocr --per_device_train_batch_size 2

I seem to run into a segfault error about 25% into the process. Here's the trace using CUDA_LAUNCH_BLOCKING=1.

  File "/layoutlmv2/run_layoutlmv2.py", line 483, in <module>
    main()
  File "/layoutlmv2/run_layoutlmv2.py", line 414, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/miniconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1316, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/miniconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1859, in training_step
    self.scaler.scale(loss).backward()
  File "/miniconda3/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/miniconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue [NOTE: This did not trigger the error].

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([2, 2048, 7, 7], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(2048, 2048, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_HALF
    padding = [0, 0, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x7f1ddc0c7520
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 2, 2048, 7, 7, 
    strideA = 100352, 49, 7, 1, 
output: TensorDescriptor 0x7f1ddc032260
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 2, 2048, 7, 7, 
    strideA = 100352, 49, 7, 1, 
weight: FilterDescriptor 0x7f1ddc0c48a0
    type = CUDNN_DATA_HALF
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 2048, 2048, 1, 1, 
Pointer addresses: 
    input: 0x7f1d7bd88000
    output: 0x7f1d47190000
    weight: 0x7f1d6c000000
Additional pointer addresses: 
    grad_output: 0x7f1d47190000
    grad_weight: 0x7f1d6c000000
Backward filter algorithm: 3

[W CUDAGuardImpl.h:113] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f1f6a86fd62 in /miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1c4d3 (0x7f1f6aad24d3 in /miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x7f1f6aad2ee2 in /miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f1f6a859314 in /miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2e9 (0x7f1fb44ded49 in /miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::Reducer::~Reducer() + 0x24d (0x7f1fb44d118d in /miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f1fc7dfbe82 in /miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f1fc722c696 in /miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xe6c26f (0x7f1fc7dfe26f in /miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x2a31e9 (0x7f1fc72351e9 in /miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x2a44ee (0x7f1fc72364ee in /miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x12255b (0x55a41ae8955b in /miniconda3/bin/python)
frame #12: <unknown function> + 0x1a9333 (0x55a41af10333 in /miniconda3/bin/python)
frame #13: <unknown function> + 0x12255b (0x55a41ae8955b in /miniconda3/bin/python)
frame #14: <unknown function> + 0x1a9333 (0x55a41af10333 in /miniconda3/bin/python)
frame #15: <unknown function> + 0x12283c (0x55a41ae8983c in /miniconda3/bin/python)
frame #16: <unknown function> + 0x134eb7 (0x55a41ae9beb7 in /miniconda3/bin/python)
frame #17: <unknown function> + 0x134e1c (0x55a41ae9be1c in /miniconda3/bin/python)
frame #18: <unknown function> + 0x162e08 (0x55a41aec9e08 in /miniconda3/bin/python)
frame #19: PyDict_SetItemString + 0x64 (0x55a41aee20c4 in /miniconda3/bin/python)
frame #20: <unknown function> + 0x26747b (0x55a41afce47b in /miniconda3/bin/python)
frame #21: Py_FinalizeEx + 0x191 (0x55a41afcea51 in /miniconda3/bin/python)
frame #22: Py_RunMain + 0x10c (0x55a41afd314c in /miniconda3/bin/python)
frame #23: Py_BytesMain + 0x39 (0x55a41afd35b9 in /miniconda3/bin/python)
frame #24: __libc_start_main + 0xe7 (0x7f1fd99d0bf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: <unknown function> + 0x1f4a64 (0x55a41af5ba64 in /miniconda3/bin/python)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33003 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33005 closing signal SIGTERM

If I run without launch blocking, I get:

Traceback (most recent call last):
  File "/layoutlmv2/run_layoutlmv2.py", line 483, in <module>
    main()
  File "/layoutlmv2/run_layoutlmv2.py", line 414, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/miniconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1316, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/miniconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1847, in training_step
    loss = self.compute_loss(model, inputs)
  File "/miniconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1881, in compute_loss
    outputs = model(**inputs)
  File "/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/miniconda3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/miniconda3/lib/python3.9/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 1197, in forward
    active_logits = logits.view(-1, self.num_labels)[active_loss]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

============================================================== Output of python -m torch.utils.collect_env:

PyTorch version: 1.10.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.18.0
Libc version: glibc-2.27

Python version: 3.9.5 (default, Jun  4 2021, 12:28:51)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.15.0-159-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 9.1.85
GPU models and configuration: 
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti
GPU 3: GeForce GTX 1080 Ti
GPU 4: GeForce GTX 1080 Ti

Nvidia driver version: 460.91.03
cuDNN version: Probably one of the following:
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.3
[pip3] torch==1.10.0
[pip3] torchvision==0.11.1
[conda] mypy-extensions           0.4.3                    pypi_0    pypi
[conda] numpy                     1.21.3                   pypi_0    pypi
[conda] torch                     1.10.0                   pypi_0    pypi
[conda] torchvision               0.11.1                   pypi_0    pypi

gyin94 commented 2 years ago

having the same issue with accelerate +1

athewsey commented 2 years ago

With DistributedDataParallel and model.layoutlmv2.visual.synchronize_batch_norm(), I'm now seeing:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.
This error indicates that your module has parameters that were not used in producing loss. You can
enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to
`torch.nn.parallel.DistributedDataParallel`, and by making sure all `forward` function outputs
participate in calculating loss.

If you already have done the above, then the distributed data parallel module wasn't able to locate
the output tensors in the return value of your module's `forward` function. Please include the loss
function and the structure of the return value of `forward` of your module when reporting this issue
(e.g. list, dict, iterable).

Parameters which did not receive grad for rank 1: layoutlmv2.pooler.dense.bias,
layoutlmv2.pooler.dense.weight, layoutlmv2.visual.backbone.fpn_output4.bias,
layoutlmv2.visual.backbone.fpn_output4.weight, layoutlmv2.visual.backbone.fpn_output3.bias,
layoutlmv2.visual.backbone.fpn_output3.weight, layoutlmv2.visual.backbone.fpn_output5.weight,
layoutlmv2.visual.backbone.fpn_output5.bias

Did anybody else come across this? I tried setting a dataset-divisible total batch size and dataloader_drop_last=True in case it was some kind of batch norm issue - but no luck...

Setup details:

transformers v4.17, running on SageMaker Distributed Data Parallel
Trainer-based training, calling training_args._setup_devices then model.layoutlmv2.visual.synchronize_batch_norm() before setting up the Trainer
Fine-tuning for token classification (tried both AutoModelForTokenClassification and specific LayoutLMv2ForTokenClassification)
LayoutLMv2Processor is pre-applied in a dataset.map() before training
Works fine in single-GPU / non-distributed setting

sujit420 commented 2 years ago

With DistributedDataParallel and model.layoutlmv2.visual.synchronize_batch_norm(), I'm now seeing:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.
This error indicates that your module has parameters that were not used in producing loss. You can
enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to
`torch.nn.parallel.DistributedDataParallel`, and by making sure all `forward` function outputs
participate in calculating loss.

If you already have done the above, then the distributed data parallel module wasn't able to locate
the output tensors in the return value of your module's `forward` function. Please include the loss
function and the structure of the return value of `forward` of your module when reporting this issue
(e.g. list, dict, iterable).

Parameters which did not receive grad for rank 1: layoutlmv2.pooler.dense.bias,
layoutlmv2.pooler.dense.weight, layoutlmv2.visual.backbone.fpn_output4.bias,
layoutlmv2.visual.backbone.fpn_output4.weight, layoutlmv2.visual.backbone.fpn_output3.bias,
layoutlmv2.visual.backbone.fpn_output3.weight, layoutlmv2.visual.backbone.fpn_output5.weight,
layoutlmv2.visual.backbone.fpn_output5.bias
Did anybody else come across this? I tried setting a dataset-divisible total batch size and dataloader_drop_last=True in case it was some kind of batch norm issue - but no luck...

Setup details:

transformers v4.17, running on SageMaker Distributed Data Parallel

Trainer-based training, calling training_args._setup_devices then model.layoutlmv2.visual.synchronize_batch_norm() before setting up the Trainer

Fine-tuning for token classification (tried both AutoModelForTokenClassification and specific LayoutLMv2ForTokenClassification)

LayoutLMv2Processor is pre-applied in a dataset.map() before training

Works fine in single-GPU / non-distributed setting

@athewsey were u able to resolve the issue?

akkikiki commented 1 year ago

I managed to run it with multiple gpus not with accelerate but rather just launching with torchrun --standalone --nnodes=1 --nproc_per_node=NUM_OF_GPUS (i.e., one process per GPU on a single node) and Trainer, then encountered RuntimeError. This is because I believe there's a typo at https://github.com/huggingface/transformers/blob/main/src/transformers/models/layoutlmv2/modeling_layoutlmv2.py#L607 which should be if not (world_size % node_size == 0) and not if not (world_size & node_size == 0) (e.g., 4 & 4 is always 1, and raises RuntimeError)

After this one character fix, now working fine.

@NielsRogge Would this be a tiny but good PR for a fix?

NielsRogge commented 1 year ago

@akkikiki feel free to open a PR!

huggingface / transformers