Trouble reproducing Bridge-Large 70.0% EM on Spider dev

salesforce / TabularSemanticParsing

Translating natural language questions to a structured query language

https://arxiv.org/abs/2012.12627

BSD 3-Clause "New" or "Revised" License

222 stars 51 forks source link

Trouble reproducing Bridge-Large 70.0% EM on Spider dev #10

Open tomerwolgithub opened 3 years ago

tomerwolgithub commented 3 years ago

First of all, thank you for sharing this terrific work. I found it really straightforward to plug and start training.

However, when training Bridge-L (with BERT large) on Spider I'm unable to reach 70.0% EM on the dev set. My results keep peaking at around 66.7%. I'm training on GeForce RTX 3090 and the default setting with batch size 16 was too much for it's 24GB mem. So, I've tried out a few runs with batch sizes 8, 4, 2 and accum steps 4, 8, 16 respecitvely (other hyperparameters are the default ones). All these runs ended up capping at <67% after more than 100K steps.

I was wondering whether this is simply the result of the different hardware and my batch sizes being smaller than 16, or am I missing something?

Thanks!

todpole3 commented 3 years ago

Varying the batch size your way shouldn't change performance because the effective batch size is kept at 32. Would you mind sharing your training curve or the accuracy change log?

tomerwolgithub commented 3 years ago

Sure thing. Below are the EM performance curves for all four of the models I tried to train. The top performing model (with batch size 8 and accum steps 4) managed to score 67.1%:

W B Chart 2_6_2021, 10_23_17 AM Untitled

Please let me know if there's any other info that might help. Thanks!

todpole3 commented 3 years ago

The difference might be caused by the Data Repair step described in section 4.3 of the paper.

I've modified the data processing steps to incorporate this step, please git pull and follow the instructions here: https://github.com/salesforce/TabularSemanticParsing#spider

mingtan888 commented 3 years ago

I got the same EM accuracy on spider dev 66.7% as well. I pulled the latest repo last week and did the data repair. My config is the exact config, except the training and dev batch size=8 instead of 16 and 24.

Shall I use some different configurations to achieve 70.0%?