microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.65k stars 2.51k forks source link

[LayoutReader] Training loss is low but inference performs terrible #826

Open Mountchicken opened 2 years ago

Mountchicken commented 2 years ago

Describe Model I am using (UniLM, MiniLM, LayoutLM ...): LayoutReader

Hi @zlwang-cs I am using layoutreader to predict layout-only data like this, which the order is from right to left, top to bottom. 12_MTH_1

However, I encountered some problems and hope you can share some insights.

zlwang-cs commented 2 years ago

Hi, thanks for your interest in our paper. I'd love to help you to fix the issues.

For the first question, I am not sure what the problem is without any more detailed information. I would recommend you to use stop points or other debugging tools to see what the shape of tensors looks like in this step.

For the second question, I guess the problem may be from the reading order of your dataset. You can see that the data you use is quite different from the original settings in our paper. The pre-trained weight is based on the left-to-right and top-to-bottom reading order so such a pre-training setting may be an obstacle in your experiment. Considering this, I don't think the poor performance is surprising. If you would like to continue solving this reading order setting, maybe you need to collect enough data to pre-train the model again or resort to other approaches. Another possible way is to turn the image 90 degrees counterclockwise so that it will be similar to the common reading order settings.

Mountchicken commented 2 years ago

Hi @zlwang-cs Tks for the prompt reply. I'll try to debug this and see if I can locate the problem. Rotate the image 90 degrees counterclockwise seems to be quite reasonable and I'll try this too.

There is one thing that still puzzles me. You mentioned that LayoutReader loads pre-trained weights during training. Is this pre-trained weight based on word-level or textline-level? It seems that I need textline-level pre-trained weight here

zlwang-cs commented 2 years ago

Hi @Mountchicken, the pre-trained model is based on the word-level. Unfortunately, I cannot help you with the textline-level pre-training.

Mountchicken commented 2 years ago

Hi @zlwang-cs Thanks for the reply. BTW, how to load the pre-trained weights and finetune it on my own dataset? I downloaded layoutreader-base-readingbank.zip from this link and got config.json, pytorch_model.bin after unpacking it.

Which arg should I assign below

python -m torch.distributed.launch --nproc_per_node=4 run_seq2seq.py \
    --model_type layoutlm \
    --model_name_or_path layoutlm-base-uncased \
    --train_folder /path/to/ReadingBank/train \
    --output_dir /path/to/output/LayoutReader/layoutlm \
    --do_lower_case \
    --fp16 \
    --fp16_opt_level O2 \
    --max_source_seq_length 513 \
    --max_target_seq_length 511 \
    --per_gpu_train_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --learning_rate 7e-5 \
    --num_warmup_steps 500 \
    --num_training_steps 75000 \
    --cache_dir /path/to/output/LayoutReader/cache \
    --label_smoothing 0.1 \
    --save_steps 5000 \
    --cached_train_features_file /path/to/ReadingBank/features_train.pt
zlwang-cs commented 2 years ago

Hi @Mountchicken , I assume weight loading is quite common in practice. Please refer to some related documents and I am sure you can find the right answer. And I see you are using the run_seq2seq.py which is for training, but the weight you downloaded is actually for decoding. I guess that is the reason why you are confused.