StopIteration when training LayoutLM on the sample data

wpm commented 4 years ago

Describe the bug I am trying to train the LayoutLM sequence labeling model as described in the LayoutLM README. Training fails with a StopIteration exception

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

To Reproduce I set up my environment like so.

conda create -n layoutlm
conda activate layoutlm
conda install -c creditx gcc-7
conda install pytorch cudatoolkit=10.1 -c pytorch
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

I preprocessed the example FUNSD data as described in the README then ran the following command.

python /home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py  \
       --data_dir /home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/data \
       --model_type layoutlm \
       --model_name_or_path /home/wmcneill/experiment/layoutlm/layoutlm-large-uncased \
       --do_lower_case \
       --max_seq_length 512 \
       --do_train \
       --num_train_epochs 100.0 \
       --logging_steps 10 \
       --save_steps -1 \
       --output_dir /home/wmcneill/experiment/layoutlm/FUNSD.layoutlm.model \
       --labels /home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/data/labels.txt \
       --per_gpu_train_batch_size 16 \
       --per_gpu_eval_batch_size 16 \
       --fp16

I see the following error very soon after starting training.

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Iteration:   0%|                                                                                                                                 | 0/5 [00:02<?, ?it/s]
Epoch:   0%|                                                                                                                                     | 0/1 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py", line 811, in <module>
    main()
  File "/home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py", line 703, in main
    global_step, tr_loss = train(
  File "/home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py", line 219, in train
    outputs = model(**inputs)
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wmcneill/src/unilm/layoutlm/layoutlm/modeling/layoutlm.py", line 211, in forward
    outputs = self.bert(
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wmcneill/src/unilm/layoutlm/layoutlm/modeling/layoutlm.py", line 143, in forward
    dtype=next(self.parameters()).dtype
StopIteration

Expected behavior I expect to train a model and have it created in the FUNSD.layoutlm.model directory. I am able to do this using the same setup on a different machine without a GPU.

Platform:
Python version: 3.8.5
PyTorch version (GPU?): 1.6
CentOS Linux release 7.6.1810 (Core)
CUDA 10.1
VIDIA-SMI 450.57 Driver Version: 450.57

sreejith3534 commented 3 years ago

can you run the same script with CUDA_LAUNCH_BLOCKING=1 at the start, I think that will give more information.

gordonrust commented 3 years ago

Any one getting this error.. see if you can run on a single gpu. In my case, I can run on a single gpu, but i get this same error when i use multiple gpus which use DataParallel models

tengerye commented 3 years ago

Encounter the same issue here where I have more than one GPUs.

hazoth commented 3 years ago

downgrade pytorch to 1.4.0 or modify the source code to not use self.parameters() in forward, you can save next(self.parameters()).dtype in init, and use the saved dtype in forward

microsoft / unilm

StopIteration when training LayoutLM on the sample data #252