google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.15k stars 217 forks source link

Fine tune tapas_wikisql_sqa_masklm_large on WTQ #53

Closed lairikeqiA closed 4 years ago

lairikeqiA commented 4 years ago

I have been fining tune the program of tapas_wikisql_sqa_masklm_large on WTQ dataset to get the tapas_wtq_wikisql_sqa_masklm_large model for a few days on GPU. But the dev acc only is 30%. What is the problem about this phenomenon?

ghost commented 4 years ago

Can you share some details? Such as the command you are running. Since you train on GPU I suspect you are using a smaller batch size? How many updates has the model performed (global_step * batch_size)?

In the experiments I ran the model reached at least 41% accuracy after 5000 steps (at batch size 512 ~= 2.5M updates) for all random runs.

lairikeqiA commented 4 years ago

That is my command:python3 run_task_main.py \ --task=WTQ \ --output_dir=/mnt/cjc/tapas-master/WTQ --model_dir=/mnt/cjc/tapas-master/tapas_wikisql_sqa_masklm_large \ --init_checkpoint=/mnt/cjc/tapas-master/tapas_wikisql_sqa_masklm_large/model.ckpt \ --bert_config_file=/mnt/cjc/tapas-master/tapas_wikisql_sqa_masklm_large/bert_config.json \ --mode=train --use_tpu=False --iterations_per_loop=5 --train_batch_size=4 --max_seq_length=512 I ran the model reached 30% accuracy after 21000 steps(at batch size 4~=84000 updates).

I have another question. Does it shuffle the test dataset of WTQ on tapas_wtq_wikisql_sqa_masklm_large model?

ghost commented 4 years ago

It should not shuffle the test data, only train.

Batch size 4 is pretty small, I would assume that you will not get the full accuracy with that batch size.

With 84,000 updates you are at 0.34% of what we usually have so you might have to wait a bit more.

How is the accuracy developing as a function the global steps?

lairikeqiA commented 4 years ago

The model converges faster in early training than in late training.

eisenjulian commented 4 years ago

@lairikeqiA indeed that is expected the learning rate decays according to tf.train.polynomial_decay and the use of the AdamOptimizer.