Closed theMADAIguy closed 1 year ago
Hi,
I've answered this question here.
TDLR: you need to first call model.layoutlmv2.visual.synchronize_batch_norm()
.
Hi @NielsRogge Thanks for your quick response. I looked at that repo as well just a couple of minutes back. The problem that I face using that solution is it gives this error:
raise RuntimeError("Make sure torch.distributed is set up properly.")
RuntimeError: Make sure torch.distributed is set up properly.
I read the above-linked post. The OP there also faces the same problem and you recommend the following:
You probably first need to call torch.distributed.init_process_group() before starting training.
Using this in the code forces me to implement DistributedDataParallel instead of the conventional DataParallel. Can you suggest something to help further?
It requires setting up the backend, rank, and world_size for DistributedDataParallel. Is this the way to go? Can you give an example of a running script that handles batch synchronization without forcing with DataParallel?
Currently, I have added the following lines of code in my script:
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
torch.distributed.init_process_group("nccl", rank=0, world_size=2)
model=LayoutLMv2_Classification_model().to(device)
model.LayoutLMv2Encoder.visual.synchronize_batch_norm()
The terminal hangs and there is no output displayed.
Any help on this case will be highly appreciated!! Thanks once again!
Are you running all of this in a notebook or as a script? The authors defined everything in a Python script, which they then launch as follows:
cd layoutlmft
python -m torch.distributed.launch --nproc_per_node=4 examples/run_funsd.py \
--model_name_or_path microsoft/layoutlmv2-base-uncased \
--output_dir /tmp/test-ner \
--do_train \
--do_predict \
--max_steps 1000 \
--warmup_ratio 0.1 \
--fp16
That's the recommended way to train deep learning models with PyTorch on multiple GPUs. torch.distributed.launch
is a helper utility that can be used to launch multiple processes per node for distributed training.
It would be great if we can add an example script for LayoutLMv2/LayoutXLM to the examples folder of HuggingFace Transformers. It would mean updating the Python script for it to work with HuggingFace Transformers instead of the original unilm repository.
Are you interested in contributing this?
Actually, let me mark it as a "good first issue" (this is a good first contribution for people interested in contributing). This way, we can help others fine-tune LayoutLMv2 on multiple GPUs.
Shall I take this up ?
@harsha070 would be great! So the goal would be to add an example script that could be called run_layoutlmv2.py
that uses the HuggingFace Trainer to fine-tune the model on the FUNSD dataset. You can also create a run_layoutlmv2_no_trainer.py
script that leverages HuggingFace Accelerate instead to run on multiple GPUs.
Do you have a setup with more than 1 GPU?
Sure. Understood. Yes, I have a multi-GPU setup.
Awesome! You can take a look at the example run_ner.py script (or other example scripts), they all use the HfArgumentParser to automatically parse the command line arguments into model_args, data_args and training_args.
You can also take a look at my example notebooks regarding fine-tuning LayoutLMv2 on the FUNSD dataset. Ideally, we also leverage HuggingFace Datasets, to automatically load the dataset from the hub. I've already uploaded that one a while ago: https://huggingface.co/datasets/nielsr/funsd
Let me know if you need any help!
Thank you for taking up this much needed suggestion. I've been running the FUNSD trainer with the following parameters:
CUDA_VISIBLE_DEVICES=0,1,2 torchrun --standalone --nnodes=1 --nproc_per_node=3 run_layoutlmv2.py --model_name_or_path microsoft/layoutlmv2-base-uncased --processor_name microsoft/layoutlmv2-base-uncased --output_dir /tmp/test-layoutlmv2 --dataset_name nielsr/funsd --do_train --do_predict --max_steps 1000 --warmup_ratio 0.1 --fp16 --model_revision no_ocr --per_device_train_batch_size 2
I seem to run into a segfault error about 25% into the process. Here's the trace using CUDA_LAUNCH_BLOCKING=1
.
File "/layoutlmv2/run_layoutlmv2.py", line 483, in <module>
main()
File "/layoutlmv2/run_layoutlmv2.py", line 414, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/miniconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1316, in train
tr_loss_step = self.training_step(model, inputs)
File "/miniconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1859, in training_step
self.scaler.scale(loss).backward()
File "/miniconda3/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/miniconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue [NOTE: This did not trigger the error].
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([2, 2048, 7, 7], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(2048, 2048, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [0, 0, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7f1ddc0c7520
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 2, 2048, 7, 7,
strideA = 100352, 49, 7, 1,
output: TensorDescriptor 0x7f1ddc032260
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 2, 2048, 7, 7,
strideA = 100352, 49, 7, 1,
weight: FilterDescriptor 0x7f1ddc0c48a0
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 2048, 2048, 1, 1,
Pointer addresses:
input: 0x7f1d7bd88000
output: 0x7f1d47190000
weight: 0x7f1d6c000000
Additional pointer addresses:
grad_output: 0x7f1d47190000
grad_weight: 0x7f1d6c000000
Backward filter algorithm: 3
[W CUDAGuardImpl.h:113] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f1f6a86fd62 in /miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1c4d3 (0x7f1f6aad24d3 in /miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x7f1f6aad2ee2 in /miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f1f6a859314 in /miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2e9 (0x7f1fb44ded49 in /miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::Reducer::~Reducer() + 0x24d (0x7f1fb44d118d in /miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f1fc7dfbe82 in /miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f1fc722c696 in /miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xe6c26f (0x7f1fc7dfe26f in /miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x2a31e9 (0x7f1fc72351e9 in /miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x2a44ee (0x7f1fc72364ee in /miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x12255b (0x55a41ae8955b in /miniconda3/bin/python)
frame #12: <unknown function> + 0x1a9333 (0x55a41af10333 in /miniconda3/bin/python)
frame #13: <unknown function> + 0x12255b (0x55a41ae8955b in /miniconda3/bin/python)
frame #14: <unknown function> + 0x1a9333 (0x55a41af10333 in /miniconda3/bin/python)
frame #15: <unknown function> + 0x12283c (0x55a41ae8983c in /miniconda3/bin/python)
frame #16: <unknown function> + 0x134eb7 (0x55a41ae9beb7 in /miniconda3/bin/python)
frame #17: <unknown function> + 0x134e1c (0x55a41ae9be1c in /miniconda3/bin/python)
frame #18: <unknown function> + 0x162e08 (0x55a41aec9e08 in /miniconda3/bin/python)
frame #19: PyDict_SetItemString + 0x64 (0x55a41aee20c4 in /miniconda3/bin/python)
frame #20: <unknown function> + 0x26747b (0x55a41afce47b in /miniconda3/bin/python)
frame #21: Py_FinalizeEx + 0x191 (0x55a41afcea51 in /miniconda3/bin/python)
frame #22: Py_RunMain + 0x10c (0x55a41afd314c in /miniconda3/bin/python)
frame #23: Py_BytesMain + 0x39 (0x55a41afd35b9 in /miniconda3/bin/python)
frame #24: __libc_start_main + 0xe7 (0x7f1fd99d0bf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: <unknown function> + 0x1f4a64 (0x55a41af5ba64 in /miniconda3/bin/python)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33003 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33005 closing signal SIGTERM
If I run without launch blocking, I get:
Traceback (most recent call last):
File "/layoutlmv2/run_layoutlmv2.py", line 483, in <module>
main()
File "/layoutlmv2/run_layoutlmv2.py", line 414, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/miniconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1316, in train
tr_loss_step = self.training_step(model, inputs)
File "/miniconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1847, in training_step
loss = self.compute_loss(model, inputs)
File "/miniconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1881, in compute_loss
outputs = model(**inputs)
File "/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/miniconda3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/miniconda3/lib/python3.9/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 1197, in forward
active_logits = logits.view(-1, self.num_labels)[active_loss]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
==============================================================
Output of python -m torch.utils.collect_env
:
PyTorch version: 1.10.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.18.0
Libc version: glibc-2.27
Python version: 3.9.5 (default, Jun 4 2021, 12:28:51) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.15.0-159-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 9.1.85
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti
GPU 3: GeForce GTX 1080 Ti
GPU 4: GeForce GTX 1080 Ti
Nvidia driver version: 460.91.03
cuDNN version: Probably one of the following:
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.3
[pip3] torch==1.10.0
[pip3] torchvision==0.11.1
[conda] mypy-extensions 0.4.3 pypi_0 pypi
[conda] numpy 1.21.3 pypi_0 pypi
[conda] torch 1.10.0 pypi_0 pypi
[conda] torchvision 0.11.1 pypi_0 pypi
having the same issue with accelerate +1
With DistributedDataParallel and model.layoutlmv2.visual.synchronize_batch_norm()
, I'm now seeing:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.
This error indicates that your module has parameters that were not used in producing loss. You can
enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to
`torch.nn.parallel.DistributedDataParallel`, and by making sure all `forward` function outputs
participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate
the output tensors in the return value of your module's `forward` function. Please include the loss
function and the structure of the return value of `forward` of your module when reporting this issue
(e.g. list, dict, iterable).
Parameters which did not receive grad for rank 1: layoutlmv2.pooler.dense.bias,
layoutlmv2.pooler.dense.weight, layoutlmv2.visual.backbone.fpn_output4.bias,
layoutlmv2.visual.backbone.fpn_output4.weight, layoutlmv2.visual.backbone.fpn_output3.bias,
layoutlmv2.visual.backbone.fpn_output3.weight, layoutlmv2.visual.backbone.fpn_output5.weight,
layoutlmv2.visual.backbone.fpn_output5.bias
Did anybody else come across this? I tried setting a dataset-divisible total batch size and dataloader_drop_last=True
in case it was some kind of batch norm issue - but no luck...
Setup details:
training_args._setup_devices
then model.layoutlmv2.visual.synchronize_batch_norm()
before setting up the Trainer
dataset.map()
before trainingWith DistributedDataParallel and
model.layoutlmv2.visual.synchronize_batch_norm()
, I'm now seeing:RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by making sure all `forward` function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). Parameters which did not receive grad for rank 1: layoutlmv2.pooler.dense.bias, layoutlmv2.pooler.dense.weight, layoutlmv2.visual.backbone.fpn_output4.bias, layoutlmv2.visual.backbone.fpn_output4.weight, layoutlmv2.visual.backbone.fpn_output3.bias, layoutlmv2.visual.backbone.fpn_output3.weight, layoutlmv2.visual.backbone.fpn_output5.weight, layoutlmv2.visual.backbone.fpn_output5.bias
Did anybody else come across this? I tried setting a dataset-divisible total batch size and
dataloader_drop_last=True
in case it was some kind of batch norm issue - but no luck...Setup details:
- transformers v4.17, running on SageMaker Distributed Data Parallel
- Trainer-based training, calling
training_args._setup_devices
thenmodel.layoutlmv2.visual.synchronize_batch_norm()
before setting up theTrainer
- Fine-tuning for token classification (tried both AutoModelForTokenClassification and specific LayoutLMv2ForTokenClassification)
- LayoutLMv2Processor is pre-applied in a
dataset.map()
before training- Works fine in single-GPU / non-distributed setting
@athewsey were u able to resolve the issue?
I managed to run it with multiple gpus not with accelerate
but rather just launching with torchrun --standalone --nnodes=1 --nproc_per_node=NUM_OF_GPUS
(i.e., one process per GPU on a single node) and Trainer, then encountered RuntimeError.
This is because I believe there's a typo at https://github.com/huggingface/transformers/blob/main/src/transformers/models/layoutlmv2/modeling_layoutlmv2.py#L607 which should be if not (world_size % node_size == 0)
and not if not (world_size & node_size == 0)
(e.g., 4 & 4
is always 1, and raises RuntimeError)
After this one character fix, now working fine.
@NielsRogge Would this be a tiny but good PR for a fix?
@akkikiki feel free to open a PR!
Environment info
transformers
version: 4.11.2Who can help
Models: LayoutLMv2 @NielsRogge
Information
Model I am using: LayoutLMv2
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Error