Open superxii opened 1 year ago
same question
same question!
You can fix this by modifying the tokenize_and_align_labels
in run_funds.py tested on transformers 4.30.2
We provide each words/boxes/word_labels directly to the tokenizer.
def tokenize_and_align_labels(examples, augmentation=False):
images = examples["image"]
words = examples["tokens"]
boxes = examples["bboxes"]
word_labels = examples["ner_tags"]
tokenized_inputs = tokenizer(
text = words,
boxes= boxes,
word_labels=word_labels,
padding=False,
truncation=True,
return_overflowing_tokens=True,
# We use this argument because the texts in our dataset are lists of words (with a label for each word).
# is_split_into_words=True,
)
.... Continued ....
T
Describe the bug Model I am using (UniLM, MiniLM, LayoutLM ...):
The problem arises when using:
A clear and concise description of what the bug is.
To Reproduce Steps to reproduce the behavior:
python ./examples/run_funsd_cord.py \ --dataset_name funsd \ --do_train --do_eval \ --model_name_or_path microsoft/layoutlmv3-base \ --output_dir ./models/layoutlmv3-base-finetuned-funsd-500 \ --segment_level_layout 1 --visual_embed 1 --input_size 224 \ --max_steps 500 --save_steps -20 --evaluation_strategy steps --eval_steps 20 \ --learning_rate 1e-5 --gradient_accumulation_steps 1 \ --load_best_model_at_end \ --metric_for_best_model "eval_f1"
Expected behavior A traning session on funsd can be started.
But I got ValueError: You must provide corresponding bounding boxes
Full Stack Trace: `[INFO|modeling_utils.py:2275] 2023-02-14 11:04:11,403 >> loading weights file pytorch_model.bin from cache at /home/datax/.cache/huggingface/hub/models--microsoft--layoutlmv3-base/snapshots/07c9b0838ccc7b49f4c284ccc96113d1dc527ff4/pytorch_model.bin [INFO|modeling_utils.py:2857] 2023-02-14 11:04:12,415 >> All model checkpoint weights were used when initializing LayoutLMv3ForTokenClassification.
[WARNING|modeling_utils.py:2859] 2023-02-14 11:04:12,416 >> Some weights of LayoutLMv3ForTokenClassification were not initialized from the model checkpoint at microsoft/layoutlmv3-base and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 0%| | 0/1 [00:00<?, ?ba/s] Traceback (most recent call last): File "./examples/run_funsd_cord.py", line 525, in
main()
File "./examples/run_funsd_cord.py", line 375, in main
train_dataset = train_dataset.map(
File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2815, in map
return self._map_single(
File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 546, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs)
File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 513, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs)
File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/fingerprint.py", line 480, in wrapper
out = func(self, args, kwargs)
File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3236, in _map_single
batch = apply_function_on_filtered_inputs(
File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3112, in apply_function_on_filtered_inputs
processed_inputs = function(fn_args, additional_args, **fn_kwargs)
File "./examples/run_funsd_cord.py", line 315, in tokenize_and_align_labels
tokenized_inputs = tokenizer(
File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/transformers/models/layoutlmv3/tokenization_layoutlmv3_fast.py", line 310, in call
raise ValueError("You must provide corresponding bounding boxes")
ValueError: You must provide corresponding bounding boxes`