google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.15k stars 217 forks source link

Problem in including the three labels #103

Open jaihonikhil opened 3 years ago

jaihonikhil commented 3 years ago

Hello, I was trying to extend the Tapas model to predict three labels and when I am trying to finetune it with the data containing three labels, it shows me a mismatch in checkpoints which is quite understandable. Can you please suggest any other way so as to include the three labels? Also can you please clarify how can we produce the interactions mentioned in the creating Pre train step particularly (--input_file="gs://tapas_models/2020_05_11/interactions.txtpb.gz" ) ? image

eisenjulian commented 3 years ago

Hello @jaihonikhil regarding the first issue, as you may imagine the problem is that changing the number of classes requires changing the dimension of the last projection, thus producing an incompatible checkpoint. In the next release we will update the code with a new flag to remove the incompatible tensor when loading a checkpoint or in the meantime you may manually remove/rename the output_weights_cls from the checkpoint, or change that same variable name to something else in tapas_classifier_model.py.

Regarding the second question, from https://arxiv.org/abs/2004.02349 Section 3:

We create pre-training inputs by extracting texttable pairs from Wikipedia. We extract 6.2M tables: 3.3M of class Infobox and 2.9M of class WikiTable. We consider tables with at most 500 cells. All of the end task datasets we experiment with only contain horizontal tables with a header row with column names. Therefore, we only extract Wiki tables of this form using the tag to identify headers. We furthermore, transpose Infoboxes into a table with a single header and a single data row. The tables, created from Infoboxes, are arguably not very typical, but we found them to improve performance on the end tasks. As a proxy for questions that appear in the end tasks, we extract the table caption, article title, article description, segment title and text of the segment the table occurs in as relevant text snippets. In this way we extract 21.3M snippets.

The concrete code we used to map from a Wikipedia dump to the interactions in the way described above depends on an internal data format so it's not useful/possible to share at this time. We may revisit and look for a workaround in the future and we are also happy to answer other questions about how those interactions were generated.

jaihonikhil commented 3 years ago

Well without knowing how the interactions are generated, Will it be possible to pretrain on our data since the interactions zip file given by you would be different? Also, I could not find output_weights_cls to be present in the checkpoint files I have searched. I am attaching the image of the files I have searched which were inside the tapas_inter_masklm_large_reset zip. Can you let me know the checkpoint file where I would be able to find output_weights_cls image `

SyrineKrichene commented 3 years ago

Hi @jaihonikhil, you can now set reset_output_cls to true when calling run_task_main.py or tapas/tapas/experiments/tapas_classifier_experiment.py.

On Thu, Jan 14, 2021 at 10:15 AM Jaya Srivastava @.***> wrote:

Well without knowing how the interactions are generated, Will it be possible to pretrain on our data since the interactions zip file given by you would be different? Also, I could not find output_weights_cls to be present in the checkpoint files I have searched. I am attaching the image of the files I have searched which were inside the tapas_inter_masklm_large_reset zip. Can you let me know the checkpoint file where I would be able to find this? [image: image] https://user-images.githubusercontent.com/48570543/104568974-bd7abf80-5676-11eb-86f9-4a3549304360.png `

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google-research/tapas/issues/103#issuecomment-760062625, or unsubscribe https://github.com/notifications/unsubscribe-auth/APARZONDQH4CSGLV7EVHLPLSZ2YZHANCNFSM4VZHBTCQ .