microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.23k stars 2.56k forks source link

[LMV3 Bug] ValueError: You must provide corresponding bounding boxes with running the examples on run_funsd_cord.py #995

Open superxii opened 1 year ago

superxii commented 1 year ago

Describe the bug Model I am using (UniLM, MiniLM, LayoutLM ...):

The problem arises when using:

A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

  1. Follow the guide on LMV3
  2. Run following scripts python ./examples/run_funsd_cord.py \ --dataset_name funsd \ --do_train --do_eval \ --model_name_or_path microsoft/layoutlmv3-base \ --output_dir ./models/layoutlmv3-base-finetuned-funsd-500 \ --segment_level_layout 1 --visual_embed 1 --input_size 224 \ --max_steps 500 --save_steps -20 --evaluation_strategy steps --eval_steps 20 \ --learning_rate 1e-5 --gradient_accumulation_steps 1 \ --load_best_model_at_end \ --metric_for_best_model "eval_f1"

Expected behavior A traning session on funsd can be started.

But I got ValueError: You must provide corresponding bounding boxes

Full Stack Trace: `[INFO|modeling_utils.py:2275] 2023-02-14 11:04:11,403 >> loading weights file pytorch_model.bin from cache at /home/datax/.cache/huggingface/hub/models--microsoft--layoutlmv3-base/snapshots/07c9b0838ccc7b49f4c284ccc96113d1dc527ff4/pytorch_model.bin [INFO|modeling_utils.py:2857] 2023-02-14 11:04:12,415 >> All model checkpoint weights were used when initializing LayoutLMv3ForTokenClassification.

[WARNING|modeling_utils.py:2859] 2023-02-14 11:04:12,416 >> Some weights of LayoutLMv3ForTokenClassification were not initialized from the model checkpoint at microsoft/layoutlmv3-base and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 0%| | 0/1 [00:00<?, ?ba/s] Traceback (most recent call last): File "./examples/run_funsd_cord.py", line 525, in main() File "./examples/run_funsd_cord.py", line 375, in main train_dataset = train_dataset.map( File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2815, in map return self._map_single( File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 546, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs) File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 513, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/fingerprint.py", line 480, in wrapper out = func(self, args, kwargs) File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3236, in _map_single batch = apply_function_on_filtered_inputs( File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3112, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, additional_args, **fn_kwargs) File "./examples/run_funsd_cord.py", line 315, in tokenize_and_align_labels tokenized_inputs = tokenizer( File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/transformers/models/layoutlmv3/tokenization_layoutlmv3_fast.py", line 310, in call raise ValueError("You must provide corresponding bounding boxes") ValueError: You must provide corresponding bounding boxes`

Forrest-ht commented 1 year ago

same question

TK12331 commented 1 year ago

same question!

gregbugaj commented 1 year ago

You can fix this by modifying the tokenize_and_align_labels in run_funds.py tested on transformers 4.30.2

We provide each words/boxes/word_labels directly to the tokenizer.

    def tokenize_and_align_labels(examples, augmentation=False):
        images = examples["image"]
        words = examples["tokens"]
        boxes = examples["bboxes"]
        word_labels = examples["ner_tags"]

        tokenized_inputs = tokenizer(
            text = words,
            boxes= boxes,
            word_labels=word_labels,
            padding=False,
            truncation=True,
            return_overflowing_tokens=True,
            # We use this argument because the texts in our dataset are lists of words (with a label for each word).
            # is_split_into_words=True,
        )

 ....  Continued  ....

T