microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.63k stars 2.51k forks source link

LayoutLM Sequence Labelling Task: Assertion Error for Prediction/Inference on custom dataset (same error for evaluation) #326

Open AIMLAPP opened 3 years ago

AIMLAPP commented 3 years ago

Hi,

When I run inference/prediction for the LayoutLM Sequence Labelling Task on a custom test data, I receive the following error. This also happens when I run evaluation on my custom testing data. If I just use the default training and testing set, everything runs smoothly. Could anyone please advise me on the correct steps to implement this, as I am currently stuck? I have tried multiple methods, which all result in the same error.

image

Please note that based on the following link, I have made some minor edits to run_seq_labeling.py https://github.com/microsoft/unilm/issues/152

Method 1a) Step 1) Train. Please note that the folder "data" contains training data and original testing data.

python run_seq_labeling.py --data_dir data \ --model_type layoutlm \ --model_name_or_path path/to/pretrained/model/directory \ --do_lower_case \ --max_seq_length 512 \ --do_train \ --num_train_epochs 100.0 \ --logging_steps 10 \ --save_steps -1 \ --output_dir path/to/output/directory \ --labels data/labels.txt \ --per_gpu_train_batch_size 16 \ --per_gpu_eval_batch_size 16 \ --fp16

Step 2) Run inference/prediction. Please note that the folder "data1" contains custom testing data.

python run_seq_labeling.py --do_predict \ --data_dir data1 \ --model_type layoutlm \ --model_name_orpath output \ --do_lower_case \ --output_dir predictions1 \ --labels data1/labels.txt \

Result: Assertion Error

Method 1b) Step 1) Train

Step 2) Evaluate

python run_seq_labeling.py --data_dir infer_data \ --model_type layoutlm \ --model_name_or_path output_method1 \ --do_lower_case \ --do_eval \ --output_dir output_eval1 \ --labels infer_data/labels.txt \ Result: Assertion Error Step 3) Infer (Can't do this step due to error in Step 2)

Method 2a:

Step 1) Train. With testing data in data_dir being REPLACED with infer (custom) data python run_seq_labeling.py --data_dir data_method2 \ --model_type layoutlm \ --model_name_or_path model \ --do_lower_case \ --max_seq_length 512 \ --do_train \ --num_train_epochs 10.0 \ --logging_steps 10 \ --save_steps -1 \ --output_dir output_method2 \ --labels data_method2/labels.txt \ --per_gpu_train_batch_size 16 \ --per_gpu_eval_batch_size 16 \ --fp16

Step 2) Run Inference/Prediction

python run_seq_labeling.py --do_predict \ --data_dir infer_data \ --model_type layoutlm \ --model_name_or_path output_method2 \ --do_lower_case \ --output_dir pred_m2_no_eval \ --labels data1/labels.txt \ --fp16 Result: Assertion Error

Method 2b: Step 1) Train. Same as Method 2a

Step 2) Run Evaluation

python run_seq_labeling.py --data_dir data_method2 \ --model_type layoutlm \ --model_name_or_path output_method2 \ --do_lower_case \ --do_eval \ --output_dir output_eval2 \ --labels infer_data/labels.txt \ Result: Assertion Error.

Step 3) Inference. Unable to proceed to this step.

knitemblazor commented 3 years ago

i would suggest you go through this repo https://github.com/knitemblazor/Multilingual_LayoutLM