microsoft / LayoutGeneration

MIT License
121 stars 17 forks source link

Some "type" Task Questions in LayOutDiffusion #43

Open molu-ggg opened 3 months ago

molu-ggg commented 3 months ago

Hello, I want to implement a function that, given several labels, generates coordinates corresponding to the labels to form a reasonable layout. Is this task what you referred to as the "type" method? However, I have a few questions:

  1. Is the training phase generic regardless of the task? It seems that there is no command provided to train the "type" task.

  2. In "type" Task, are input labels not fixed ? I think is the input is the fixed labels type, the same output content at least labels is the same. (Look the end "Data 1 and Data 2 , both have "2 tables and 4 texts") . I found that during the first few training steps used for testing(5000 steps), the model's output was very chaotic(see Data 3), including the labels. It was only toward the end that the labels gradually stabilized, but they still did not completely match the test set.The input and output labels are different, which makes me confused. What should I do ? What could be the reason for this?

I look forward to your reply. Thank you very much for your help~

-- Data1 (a Test sample): table 10 14 115 59 | table 10 67 115 74 | text 10 78 61 119 | text 65 78 115 119 | text 31 11 94 13 | text 10 63 115 67

-- Data2 (pretrained generate): ["START table 10 23 115 74 | table 10 77 115 111 | text 10 9 61 18 | text 65 9 115 18 | text 10 21 50 22 | text 10 75 81 76 END PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD"]

-- Data3 (my generate): ["START 29 PAD PAD 121 PAD | 118 35 29 PAD PAD | 104 PAD PAD PAD PAD | table PAD PAD 47 PAD 75 100 PAD PAD PAD PAD | table 25 PAD PAD PAD 15 PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD 17 PAD PAD PAD PAD PAD PAD PAD PAD PAD 99 PAD PAD PAD 107 PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD 89 PAD 57 PAD PAD"]

my training command:

python scripts/train.py --checkpoint_path ../results/checkpoint/pub_cond --model_arch transformer --modality e2e-tgt --save_interval 1000 --lr 3e-5 --batch_size 32 --diffusion_steps 200 --noise_schedule gaussian_refine_pow2.5 --use_kl False --learn_sigma False --aux_loss True --rescale_timesteps False --seq_length 121 --num_channels 128 --seed 102 --dropout 0.1 --padding_mode pad --experiment random --lr_anneal_steps 400000 --weight_decay 0.0 --predict_xstart True --training_mode discrete1 --vocab_size 139 --submit False --e2e_train ../data/processed_datasets/PublayNet_ltrb_lex

inference command:

python scripts/batch_decode.py ../results/checkpoint/pub_cond -1.0 ema 20 20 False -1 type

Junyi42 commented 3 months ago

Hi,

Is the training phase generic regardless of the task?

Yes, our method enables conditional generation in a plug-and-play manner, please refer to Sec. 4.3 of our paper for more details.

In "type" Task, are input labels not fixed?

Yes, we only feed the input labels when starting the sampling process (implementation here). Therefore, it is possible that the output layout violates the input data if the model is not trained well. One simple fix for this is to simply fix the label (and format) tokens at each sampling step. We did not see this problem in our experiments so we didn't do that.

Please feel free to let me know if there's any further problems.

Thanks.