Closed aminfardi closed 4 years ago
@aminfardi I re-tried everything from the beginnig and at the step of creating the data for the training, I realised I got many errors
Processed: test.tfrecord I1013 18:04:12.467182 140220029744960 run_task_main.py:152] Processed: test.tfrecord Num questions processed: 15878 I1013 18:04:12.467303 140220029744960 run_task_main.py:152] Num questions processed: 15878 Num examples: 15243 I1013 18:04:12.467345 140220029744960 run_task_main.py:152] Num examples: 15243 Num conversion errors: 635 I1013 18:04:12.467382 140220029744960 run_task_main.py:152] Num conversion errors: 635 Padded with 21 examples. I1013 18:04:12.492669 140220029744960 run_task_main.py:152] Padded with 21 examples.
But the report says everything is fone so I continued
After the train and the evaluate task I got these results
I1013 18:24:14.170778 140001886750528 calc_metrics_utils.py:414] denotation_accuracy=0.907731 dev denotation accuracy: 0.9077 I1013 18:24:14.308156 140001886750528 run_task_main.py:152] dev denotation accuracy: 0.9077 I1013 18:24:49.742995 140001886750528 calc_metrics_utils.py:414] denotation_accuracy=0.908742 test denotation accuracy: 0.9087 I1013 18:24:50.039621 140001886750528 run_task_main.py:152] test denotation accuracy: 0.9087
Then I relaunched with the predict_evaluate task and I got (as before):
I1013 18:40:30.455762 139800481957696 calc_metrics_utils.py:414] denotation_accuracy=0.783992 dev denotation accuracy: 0.7840 I1013 18:40:30.588578 139800481957696 run_task_main.py:152] dev denotation accuracy: 0.7840 I1013 18:41:04.743687 139800481957696 calc_metrics_utils.py:414] denotation_accuracy=0.771256 test denotation accuracy: 0.7713 I1013 18:41:05.011397 139800481957696 run_task_main.py:152] test denotation accuracy: 0.7713 Evaluation finished after training step 0. I1013 18:41:05.182357 139800481957696 run_task_main.py:152] Evaluation finished after training step 0.
So can you please retry and share your results without the predic-eval? It seems to have better results (I don't know why)...
@alxblandin, I don't understand why train_evaluate would be needed here. We are essentially pulling a pre-trained model and just evaluating. What dataset are your metrics that for? I feel like we're missing something very simple...
@aminfardi I did it to verify if fine-tuning works as in this issue : #6
You shouldn't need "train_evaluate" with a model that was already fine-tuned. One problem that I see is that "reset" models need a flag "--reset_position_index_per_cell":
!python tapas/tapas/run_task_main.py \
--task="WTQ" \
--output_dir="gs://gs-colab-bucket/data_wtq" \
--bert_config_file="gs://gs-colab-bucket/data_wtq/wtq/model/bert_config.json" \
--bert_vocab_file="gs://gs-colab-bucket/data_wtq/wtq/model/vocab.txt" \
--loop_predict="false" \
--tpu_name='grpc://10.92.172.218:8470' \
--use_tpu="true" \
--mode="predict_and_evaluate" \
--reset_position_index_per_cell
This changes the positional embeddings used by the model to the version used by the "reset" models. This would explain why running train and evaluate improves the results as it would retrain the "reset" model to use "non-reset" embeddings.
I've tried to reproduce the results on SQA in this colab: https://colab.research.google.com/drive/18aaKn34gVriZqJjag02WfhVch2_aK9Ak?usp=sharing. I'm getting resuls that are much closer (0.7241 vs reported 0.7289, so still a slight difference somewhere..):
dev denotation accuracy: 0.7316
dev_seq denotation accuracy: 0.6865
test denotation accuracy: 0.7580
test_seq denotation accuracy: 0.7241
Could you try again with the flag?
@pawelknow Thanks! I missed this flag yes, but where do you see an accuracy of 0.7289 in the README? What model do you try to reproduce the numbers? You say you tried on SQA put the command you provided is for the WTQ task.
This is the command Pawel used for testing this in a colab:
python3 tapas/tapas/run_task_main.py \
--task="SQA" \
--input_dir="SQA Release 1.0/" \
--output_dir="results" \
--bert_config_file="tapas_sqa_inter_masklm_large_reset/bert_config.json" \
--bert_vocab_file="tapas_sqa_inter_masklm_large_reset/vocab.txt" \
--mode="create_data"
It's using one of the new models.
For completeness, you need to download the model (and put it in the right place) and the SQA datasets.
These are the commands Pawel used for that:
wget https://storage.googleapis.com/tapas_models/2020_10_07/tapas_sqa_inter_masklm_large_reset.zip
unzip tapas_sqa_inter_masklm_large_reset.zip
os.makedirs('results/sqa/model', exist_ok=True)
with open('results/sqa/model/checkpoint', 'w') as f:
f.write('model_checkpoint_path: "model.ckpt-0"')
for suffix in ['.data-00000-of-00001', '.index', '.meta']:
shutil.copyfile(f'tapas_sqa_inter_masklm_large_reset/model.ckpt{suffix}',
f'results/sqa/model/model.ckpt-0{suffix}')
wget https://download.microsoft.com/download/1/D/C/1DC270D2-1B53-4A61-A2E3-88AB3E4E6E1F/SQA%20Release%201.0.zip
unzip "SQA Release 1.0.zip"
Everything else should be the same as in the normal TAPAS colab.
@thomasmueller-google thanks for your quick answer. For research I wonder how your model can be adapted for an another langage. I mean at which step of training (pre-training, intermediate, fine-tuning) the data needs to be in an other natural language than English. Does the vocab file still changes at each of these steps? Or it's only at the first steps of training? If yes it is possible to do few more steps of pre-training after one of your checkpoint? Maybe I should create a new issue for this specifiquely, to keep on the current topic here are my results after following your indications :
I1014 21:47:09.579313 140179987752768 calc_metrics_utils.py:414] denotation_accuracy=0.885881 dev denotation accuracy: 0.8859 I1014 21:47:09.709722 140179987752768 run_task_main.py:152] dev denotation accuracy: 0.8859 I1014 21:47:43.541317 140179987752768 calc_metrics_utils.py:414] denotation_accuracy=0.862262 test denotation accuracy: 0.8623 I1014 21:47:43.811320 140179987752768 run_task_main.py:152] test denotation accuracy: 0.8623 Evaluation finished after training step 0. I1014 21:47:43.980646 140179987752768 run_task_main.py:152] Evaluation finished after training step 0.
They are EXACTLY the same as in the README (I tested on the wikisql supervised base dataset without reset) thanks a lot :) (I ping @aminfardi as well because we were trying so hard on this problem the last days)
Regarding adaptation to other languages, just a quick response, please open a separate issue for more details.
We are using the same vocabulary for all training stages. It's the standard BERT uncased vocabulary.
In general, I would think that some of the things that the model learns during pre-training should translate to table understanding tasks in other languages. Therefore, I would think that the English models can be a decent starting point for fine-tuning tasks in other languages. That said a model pre-trained on the target language would probably work better.
@thomasmueller-google you can give a longer response here : #76 I was thinking the same thing, the model should adapt in the fine-tuning. I'll keep you in touch of my conclusions on how your model is resilient to changement in the language at the fine-tuning :)
Can we consider this issues as fixed?
Thank you @pawelknow for flagging the reset_position_index_per_cell command. I was using the run_task_main.py help and it didn't have that as an option so I completely missed it. Looking back at the example Colabs they are in there however - I'll be sure to check those closely going forward.
On SQA I now get a test accuracy of 0.7361. This is much closer, but I suppose there will always be small discrepancies?
EDIT: I just realized the README reports dev accuracy. I'm getting dev denotation accuracy: 0.7166, compared to 0.7130 reported. I'm going to do the same comparison on the other datasets to be sure.
@thomasmueller-google, I will close this. Appreciate everyone's input in resolving this issue.
@alxblandin, curious what model you were trying to match for WikiSQL? My results for tapas_wikisql_sqa_masklm_small_reset are as follows:
For WikiSQL_supervised, 0.8550 dev accuracy is reported in README. I'm getting.
`tf.data.TFRecordDataset(path)`
I1016 17:39:23.044502 140464927541120 calc_metrics_utils.py:414] denotation_accuracy=0.854649
dev denotation accuracy: 0.8546
I1016 17:39:23.184627 140464927541120 run_task_main.py:152] dev denotation accuracy: 0.8546
I1016 17:40:15.586334 140464927541120 calc_metrics_utils.py:414] denotation_accuracy=0.830142
test denotation accuracy: 0.8301
I1016 17:40:15.853811 140464927541120 run_task_main.py:152] test denotation accuracy: 0.8301
I'm still unable to match the weakly supervised.
WTQ results on tapas_wtq_wikisql_sqa_masklm_large_reset:
`tf.data.TFRecordDataset(path)`
I1016 18:03:25.518479 140645591684992 calc_metrics_utils.py:414] denotation_accuracy=0.502847
dev denotation accuracy: 0.5028
I1016 18:03:25.565771 140645591684992 run_task_main.py:152] dev denotation accuracy: 0.5028
I1016 18:03:39.927309 140645591684992 calc_metrics_utils.py:414] denotation_accuracy=0.505064
test denotation accuracy: 0.5051
I1016 18:03:39.997798 140645591684992 run_task_main.py:152] test denotation accuracy: 0.5051
Compared to README at 0.4952 dev accuracy.
I'd say these results are close enough.
@aminfardi I tested on tapas_wikisql_sqa_inter_masklm_base.zip
I'm trying to produce the reported numbers in the README in Colab for WikiSQL (supervised and weakly supervised) and SQA.
For WikiSQL I am using tapas_wikisql_sqa_masklm_small_reset. For the weakly supervised I tried:
But results were nowhere close:
Trying WikiSQL supervised:
Resulted in much better numbers, but still not matching the README:
Next I tried SQA using tapas_sqa_masklm_large_reset:
But I again get results 5% lower than README:
I see the warnings in above regarding sequence, but that seems to have to do with the TPU:
Warning: Skipping SQA sequence evaluation because eval is running on TPU.
Update: also tried the WTQ dataset and tapas_wtq_wikisql_sqa_masklm_large_reset model:
Getting results 10% lower than the README:
I'm wondering if anyone has been able to successfully replicate the reported numbers?