google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.15k stars 217 forks source link

Reproducing WikiSQL and SQA numbers #75

Closed aminfardi closed 4 years ago

aminfardi commented 4 years ago

I'm trying to produce the reported numbers in the README in Colab for WikiSQL (supervised and weakly supervised) and SQA.

For WikiSQL I am using tapas_wikisql_sqa_masklm_small_reset. For the weakly supervised I tried:

!python tapas/tapas/run_task_main.py \
  --task="WIKISQL" \
  --output_dir="gs://gs-colab-bucket/data" \
  --bert_config_file="data/wikisql/model/bert_config.json" \
  --bert_vocab_file="data/wikisql/model/vocab.txt" \
  --tpu_name='grpc://10.11.116.106:8470' \
  --use_tpu="true" \
  --mode="predict_and_evaluate"

But results were nowhere close:

I1006 22:51:43.239191 140375104935808 calc_metrics_utils.py:414] denotation_accuracy=0.004988
dev denotation accuracy: 0.0050
I1006 22:51:43.456792 140375104935808 run_task_main.py:152] dev denotation accuracy: 0.0050
I1006 22:52:38.437489 140375104935808 calc_metrics_utils.py:414] denotation_accuracy=0.006361
test denotation accuracy: 0.0064
I1006 22:52:38.888067 140375104935808 run_task_main.py:152] test denotation accuracy: 0.0064

Trying WikiSQL supervised:


!python tapas/tapas/run_task_main.py \
  --task="WIKISQL_SUPERVISED" \
  --output_dir="gs://gs-colab-bucket/data_supervised" \
  --bert_config_file="data_supervised/wikisql_supervised/model/bert_config.json" \
  --bert_vocab_file="data_supervised/wikisql_supervised/model/vocab.txt" \
  --tpu_name='grpc://10.11.116.106:8470' \
  --use_tpu="true" \
  --mode="predict_and_evaluate"

Resulted in much better numbers, but still not matching the README:

`tf.data.TFRecordDataset(path)`
I1010 04:55:43.283555 140631071238016 calc_metrics_utils.py:414] denotation_accuracy=0.749673
dev denotation accuracy: 0.7497
I1010 04:55:43.428153 140631071238016 run_task_main.py:152] dev denotation accuracy: 0.7497
I1010 04:56:35.534448 140631071238016 calc_metrics_utils.py:414] denotation_accuracy=0.736554
test denotation accuracy: 0.7366
I1010 04:56:35.813148 140631071238016 run_task_main.py:152] test denotation accuracy: 0.7366

Next I tried SQA using tapas_sqa_masklm_large_reset:

!python tapas/tapas/run_task_main.py \
  --task="SQA" \
  --output_dir="gs://gs-colab-bucket/data_sqa" \
  --bert_config_file="gs://gs-colab-bucket/data_sqa/sqa/model/bert_config.json" \
  --bert_vocab_file="gs://gs-colab-bucket/data_sqa/sqa/model/vocab.txt" \
  --loop_predict="false" \
  --tpu_name='grpc://10.116.216.194:8470' \
  --use_tpu="true" \
  --mode="predict_and_evaluate"

But I again get results 5% lower than README:

I1010 20:17:49.353913 140657230886784 calc_metrics_utils.py:414] denotation_accuracy=0.642384
dev denotation accuracy: 0.6424
I1010 20:17:49.409619 140657230886784 run_task_main.py:152] dev denotation accuracy: 0.6424
Warning: Can't evaluate for dev_seq because gs://gs-colab-bucket/data_sqa/sqa/model/random-split-1-dev_sequence.tsv doesn't exist.
W1010 20:17:49.546101 140657230886784 run_task_main.py:157] Can't evaluate for dev_seq because gs://gs-colab-bucket/data_sqa/sqa/model/random-split-1-dev_sequence.tsv doesn't exist.
I1010 20:18:00.948407 140657230886784 calc_metrics_utils.py:414] denotation_accuracy=0.652722
test denotation accuracy: 0.6527
I1010 20:18:01.029566 140657230886784 run_task_main.py:152] test denotation accuracy: 0.6527
Warning: Can't evaluate for test_seq because gs://gs-colab-bucket/data_sqa/sqa/model/test_sequence.tsv doesn't exist.

I see the warnings in above regarding sequence, but that seems to have to do with the TPU:

Warning: Skipping SQA sequence evaluation because eval is running on TPU.

Update: also tried the WTQ dataset and tapas_wtq_wikisql_sqa_masklm_large_reset model:

!python tapas/tapas/run_task_main.py \
  --task="WTQ" \
  --output_dir="gs://gs-colab-bucket/data_wtq" \
  --bert_config_file="gs://gs-colab-bucket/data_wtq/wtq/model/bert_config.json" \
  --bert_vocab_file="gs://gs-colab-bucket/data_wtq/wtq/model/vocab.txt" \
  --loop_predict="false" \
  --tpu_name='grpc://10.92.172.218:8470' \
  --use_tpu="true" \
  --mode="predict_and_evaluate"

Getting results 10% lower than the README:

`tf.data.TFRecordDataset(path)`
I1011 04:48:55.989768 140620040472448 calc_metrics_utils.py:414] denotation_accuracy=0.387544
dev denotation accuracy: 0.3875
I1011 04:48:56.042782 140620040472448 run_task_main.py:152] dev denotation accuracy: 0.3875
I1011 04:49:14.033151 140620040472448 calc_metrics_utils.py:414] denotation_accuracy=0.390654
test denotation accuracy: 0.3907
I1011 04:49:14.111141 140620040472448 run_task_main.py:152] test denotation accuracy: 0.3907

I'm wondering if anyone has been able to successfully replicate the reported numbers?

alxblandin commented 4 years ago

@aminfardi I re-tried everything from the beginnig and at the step of creating the data for the training, I realised I got many errors

Processed: test.tfrecord I1013 18:04:12.467182 140220029744960 run_task_main.py:152] Processed: test.tfrecord Num questions processed: 15878 I1013 18:04:12.467303 140220029744960 run_task_main.py:152] Num questions processed: 15878 Num examples: 15243 I1013 18:04:12.467345 140220029744960 run_task_main.py:152] Num examples: 15243 Num conversion errors: 635 I1013 18:04:12.467382 140220029744960 run_task_main.py:152] Num conversion errors: 635 Padded with 21 examples. I1013 18:04:12.492669 140220029744960 run_task_main.py:152] Padded with 21 examples.

But the report says everything is fone so I continued

After the train and the evaluate task I got these results

I1013 18:24:14.170778 140001886750528 calc_metrics_utils.py:414] denotation_accuracy=0.907731 dev denotation accuracy: 0.9077 I1013 18:24:14.308156 140001886750528 run_task_main.py:152] dev denotation accuracy: 0.9077 I1013 18:24:49.742995 140001886750528 calc_metrics_utils.py:414] denotation_accuracy=0.908742 test denotation accuracy: 0.9087 I1013 18:24:50.039621 140001886750528 run_task_main.py:152] test denotation accuracy: 0.9087

Then I relaunched with the predict_evaluate task and I got (as before):

I1013 18:40:30.455762 139800481957696 calc_metrics_utils.py:414] denotation_accuracy=0.783992 dev denotation accuracy: 0.7840 I1013 18:40:30.588578 139800481957696 run_task_main.py:152] dev denotation accuracy: 0.7840 I1013 18:41:04.743687 139800481957696 calc_metrics_utils.py:414] denotation_accuracy=0.771256 test denotation accuracy: 0.7713 I1013 18:41:05.011397 139800481957696 run_task_main.py:152] test denotation accuracy: 0.7713 Evaluation finished after training step 0. I1013 18:41:05.182357 139800481957696 run_task_main.py:152] Evaluation finished after training step 0.

So can you please retry and share your results without the predic-eval? It seems to have better results (I don't know why)...

aminfardi commented 4 years ago

@alxblandin, I don't understand why train_evaluate would be needed here. We are essentially pulling a pre-trained model and just evaluating. What dataset are your metrics that for? I feel like we're missing something very simple...

alxblandin commented 4 years ago

@aminfardi I did it to verify if fine-tuning works as in this issue : #6

pawelknow commented 4 years ago

You shouldn't need "train_evaluate" with a model that was already fine-tuned. One problem that I see is that "reset" models need a flag "--reset_position_index_per_cell":

!python tapas/tapas/run_task_main.py \
  --task="WTQ" \
  --output_dir="gs://gs-colab-bucket/data_wtq" \
  --bert_config_file="gs://gs-colab-bucket/data_wtq/wtq/model/bert_config.json" \
  --bert_vocab_file="gs://gs-colab-bucket/data_wtq/wtq/model/vocab.txt" \
  --loop_predict="false" \
  --tpu_name='grpc://10.92.172.218:8470' \
  --use_tpu="true" \
  --mode="predict_and_evaluate" \
  --reset_position_index_per_cell

This changes the positional embeddings used by the model to the version used by the "reset" models. This would explain why running train and evaluate improves the results as it would retrain the "reset" model to use "non-reset" embeddings.

I've tried to reproduce the results on SQA in this colab: https://colab.research.google.com/drive/18aaKn34gVriZqJjag02WfhVch2_aK9Ak?usp=sharing. I'm getting resuls that are much closer (0.7241 vs reported 0.7289, so still a slight difference somewhere..):

dev denotation accuracy: 0.7316
dev_seq denotation accuracy: 0.6865
test denotation accuracy: 0.7580
test_seq denotation accuracy: 0.7241

Could you try again with the flag?

alxblandin commented 4 years ago

@pawelknow Thanks! I missed this flag yes, but where do you see an accuracy of 0.7289 in the README? What model do you try to reproduce the numbers? You say you tried on SQA put the command you provided is for the WTQ task.

ghost commented 4 years ago

This is the command Pawel used for testing this in a colab:

python3 tapas/tapas/run_task_main.py \
  --task="SQA" \
  --input_dir="SQA Release 1.0/" \
  --output_dir="results" \
  --bert_config_file="tapas_sqa_inter_masklm_large_reset/bert_config.json" \
  --bert_vocab_file="tapas_sqa_inter_masklm_large_reset/vocab.txt" \
  --mode="create_data"

It's using one of the new models.

ghost commented 4 years ago

For completeness, you need to download the model (and put it in the right place) and the SQA datasets.

These are the commands Pawel used for that:

Model

wget https://storage.googleapis.com/tapas_models/2020_10_07/tapas_sqa_inter_masklm_large_reset.zip
unzip tapas_sqa_inter_masklm_large_reset.zip
os.makedirs('results/sqa/model', exist_ok=True)
with open('results/sqa/model/checkpoint', 'w') as f:
  f.write('model_checkpoint_path: "model.ckpt-0"')
for suffix in ['.data-00000-of-00001', '.index', '.meta']:
  shutil.copyfile(f'tapas_sqa_inter_masklm_large_reset/model.ckpt{suffix}', 
                  f'results/sqa/model/model.ckpt-0{suffix}')

Data

wget https://download.microsoft.com/download/1/D/C/1DC270D2-1B53-4A61-A2E3-88AB3E4E6E1F/SQA%20Release%201.0.zip
unzip "SQA Release 1.0.zip"

Everything else should be the same as in the normal TAPAS colab.

alxblandin commented 4 years ago

@thomasmueller-google thanks for your quick answer. For research I wonder how your model can be adapted for an another langage. I mean at which step of training (pre-training, intermediate, fine-tuning) the data needs to be in an other natural language than English. Does the vocab file still changes at each of these steps? Or it's only at the first steps of training? If yes it is possible to do few more steps of pre-training after one of your checkpoint? Maybe I should create a new issue for this specifiquely, to keep on the current topic here are my results after following your indications :

I1014 21:47:09.579313 140179987752768 calc_metrics_utils.py:414] denotation_accuracy=0.885881 dev denotation accuracy: 0.8859 I1014 21:47:09.709722 140179987752768 run_task_main.py:152] dev denotation accuracy: 0.8859 I1014 21:47:43.541317 140179987752768 calc_metrics_utils.py:414] denotation_accuracy=0.862262 test denotation accuracy: 0.8623 I1014 21:47:43.811320 140179987752768 run_task_main.py:152] test denotation accuracy: 0.8623 Evaluation finished after training step 0. I1014 21:47:43.980646 140179987752768 run_task_main.py:152] Evaluation finished after training step 0.

They are EXACTLY the same as in the README (I tested on the wikisql supervised base dataset without reset) thanks a lot :) (I ping @aminfardi as well because we were trying so hard on this problem the last days)

ghost commented 4 years ago

Regarding adaptation to other languages, just a quick response, please open a separate issue for more details.

We are using the same vocabulary for all training stages. It's the standard BERT uncased vocabulary.

In general, I would think that some of the things that the model learns during pre-training should translate to table understanding tasks in other languages. Therefore, I would think that the English models can be a decent starting point for fine-tuning tasks in other languages. That said a model pre-trained on the target language would probably work better.

alxblandin commented 4 years ago

@thomasmueller-google you can give a longer response here : #76 I was thinking the same thing, the model should adapt in the fine-tuning. I'll keep you in touch of my conclusions on how your model is resilient to changement in the language at the fine-tuning :)

muelletm commented 4 years ago

Can we consider this issues as fixed?

aminfardi commented 4 years ago

Thank you @pawelknow for flagging the reset_position_index_per_cell command. I was using the run_task_main.py help and it didn't have that as an option so I completely missed it. Looking back at the example Colabs they are in there however - I'll be sure to check those closely going forward.

On SQA I now get a test accuracy of 0.7361. This is much closer, but I suppose there will always be small discrepancies?

EDIT: I just realized the README reports dev accuracy. I'm getting dev denotation accuracy: 0.7166, compared to 0.7130 reported. I'm going to do the same comparison on the other datasets to be sure.

@thomasmueller-google, I will close this. Appreciate everyone's input in resolving this issue.

aminfardi commented 4 years ago

@alxblandin, curious what model you were trying to match for WikiSQL? My results for tapas_wikisql_sqa_masklm_small_reset are as follows:

For WikiSQL_supervised, 0.8550 dev accuracy is reported in README. I'm getting.

`tf.data.TFRecordDataset(path)`
I1016 17:39:23.044502 140464927541120 calc_metrics_utils.py:414] denotation_accuracy=0.854649
dev denotation accuracy: 0.8546
I1016 17:39:23.184627 140464927541120 run_task_main.py:152] dev denotation accuracy: 0.8546
I1016 17:40:15.586334 140464927541120 calc_metrics_utils.py:414] denotation_accuracy=0.830142
test denotation accuracy: 0.8301
I1016 17:40:15.853811 140464927541120 run_task_main.py:152] test denotation accuracy: 0.8301

I'm still unable to match the weakly supervised.

aminfardi commented 4 years ago

WTQ results on tapas_wtq_wikisql_sqa_masklm_large_reset:

`tf.data.TFRecordDataset(path)`
I1016 18:03:25.518479 140645591684992 calc_metrics_utils.py:414] denotation_accuracy=0.502847
dev denotation accuracy: 0.5028
I1016 18:03:25.565771 140645591684992 run_task_main.py:152] dev denotation accuracy: 0.5028
I1016 18:03:39.927309 140645591684992 calc_metrics_utils.py:414] denotation_accuracy=0.505064
test denotation accuracy: 0.5051
I1016 18:03:39.997798 140645591684992 run_task_main.py:152] test denotation accuracy: 0.5051

Compared to README at 0.4952 dev accuracy.

I'd say these results are close enough.

alxblandin commented 4 years ago

@aminfardi I tested on tapas_wikisql_sqa_inter_masklm_base.zip