microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.64k stars 2.51k forks source link

StopIteration when training LayoutLM on the sample data #252

Open wpm opened 4 years ago

wpm commented 4 years ago

Describe the bug I am trying to train the LayoutLM sequence labeling model as described in the LayoutLM README. Training fails with a StopIteration exception

The problem arises when using:

To Reproduce I set up my environment like so.

conda create -n layoutlm
conda activate layoutlm
conda install -c creditx gcc-7
conda install pytorch cudatoolkit=10.1 -c pytorch
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

I preprocessed the example FUNSD data as described in the README then ran the following command.

python /home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py  \
       --data_dir /home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/data \
       --model_type layoutlm \
       --model_name_or_path /home/wmcneill/experiment/layoutlm/layoutlm-large-uncased \
       --do_lower_case \
       --max_seq_length 512 \
       --do_train \
       --num_train_epochs 100.0 \
       --logging_steps 10 \
       --save_steps -1 \
       --output_dir /home/wmcneill/experiment/layoutlm/FUNSD.layoutlm.model \
       --labels /home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/data/labels.txt \
       --per_gpu_train_batch_size 16 \
       --per_gpu_eval_batch_size 16 \
       --fp16

I see the following error very soon after starting training.

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Iteration:   0%|                                                                                                                                 | 0/5 [00:02<?, ?it/s]
Epoch:   0%|                                                                                                                                     | 0/1 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py", line 811, in <module>
    main()
  File "/home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py", line 703, in main
    global_step, tr_loss = train(
  File "/home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py", line 219, in train
    outputs = model(**inputs)
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wmcneill/src/unilm/layoutlm/layoutlm/modeling/layoutlm.py", line 211, in forward
    outputs = self.bert(
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wmcneill/src/unilm/layoutlm/layoutlm/modeling/layoutlm.py", line 143, in forward
    dtype=next(self.parameters()).dtype
StopIteration

Expected behavior I expect to train a model and have it created in the FUNSD.layoutlm.model directory. I am able to do this using the same setup on a different machine without a GPU.

sreejith3534 commented 3 years ago

can you run the same script with CUDA_LAUNCH_BLOCKING=1 at the start, I think that will give more information.

gordonrust commented 3 years ago

Any one getting this error.. see if you can run on a single gpu. In my case, I can run on a single gpu, but i get this same error when i use multiple gpus which use DataParallel models

tengerye commented 3 years ago

Encounter the same issue here where I have more than one GPUs.

hazoth commented 3 years ago

downgrade pytorch to 1.4.0 or modify the source code to not use self.parameters() in forward, you can save next(self.parameters()).dtype in init, and use the saved dtype in forward