google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.13k stars 217 forks source link

WTQ TableQuestionAnswering Task: answer_coordinates for weakly supervision task #134

Open ManasiPat opened 3 years ago

ManasiPat commented 3 years ago

I am using the pytorch hugginface model (https://huggingface.co/transformers/model_doc/tapas.html) for tablequestion answering task. My data has only the final answer for supervision and no answer co-ordinates or aggregation labels. I have created data in the SQA format tsv file where I feed the values of answer_text, answer_coordinates and aggretation_labels columns to be None and the value of float_answer column to be my answer. I am getting the following error from TapasTokenizer:

Traceback (most recent call last): File "tapas_pytorch/test.py", line 76, in for idx, batch in enumerate(train_dataloader): File "/mnt/nfs/deep-learning-and-ai/deep-learning-under-data-sparsity/Manasi/tapas_pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/mnt/nfs/deep-learning-and-ai/deep-learning-under-data-sparsity/Manasi/tapas_pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/mnt/nfs/deep-learning-and-ai/deep-learning-under-data-sparsity/Manasi/tapas_pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/mnt/nfs/deep-learning-and-ai/deep-learning-under-data-sparsity/Manasi/tapas_pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "tapas_pytorch/test.py", line 33, in getitem encoding = self.tokenizer(table=table, File "/mnt/nfs/deep-learning-and-ai/deep-learning-under-data-sparsity/Manasi/tapas_pytorch/lib/python3.8/site-packages/transformers/models/tapas/tokenization_tapas.py", line 617, in call return self.encode_plus( File "/mnt/nfs/deep-learning-and-ai/deep-learning-under-data-sparsity/Manasi/tapas_pytorch/lib/python3.8/site-packages/transformers/models/tapas/tokenization_tapas.py", line 966, in encode_plus return self._encode_plus( File "/mnt/nfs/deep-learning-and-ai/deep-learning-under-data-sparsity/Manasi/tapas_pytorch/lib/python3.8/site-packages/transformers/models/tapas/tokenization_tapas.py", line 1020, in _encode_plus return self.prepare_for_model( File "/mnt/nfs/deep-learning-and-ai/deep-learning-under-data-sparsity/Manasi/tapas_pytorch/lib/python3.8/site-packages/transformers/models/tapas/tokenization_tapas.py", line 1177, in prepare_for_model labels = self.get_answer_ids(column_ids, row_ids, table_data, answer_text, answer_coordinates) File "/mnt/nfs/deep-learning-and-ai/deep-learning-under-data-sparsity/Manasi/tapas_pytorch/lib/python3.8/site-packages/transformers/models/tapas/tokenization_tapas.py", line 1754, in get_answer_ids return self._get_answer_ids(column_ids, row_ids, answer_coordinates_question) File "/mnt/nfs/deep-learning-and-ai/deep-learning-under-data-sparsity/Manasi/tapas_pytorch/lib/python3.8/site-packages/transformers/models/tapas/tokenization_tapas.py", line 1740, in _get_answer_ids answer_ids, missing_count = self._get_all_answer_ids(column_ids, row_ids, answer_coordinates) File "/mnt/nfs/deep-learning-and-ai/deep-learning-under-data-sparsity/Manasi/tapas_pytorch/lib/python3.8/site-packages/transformers/models/tapas/tokenization_tapas.py", line 1666, in _get_all_answer_ids column_ids, row_ids, answers_list=(_to_coordinates(answer_coordinates)) File "/mnt/nfs/deep-learning-and-ai/deep-learning-under-data-sparsity/Manasi/tapas_pytorch/lib/python3.8/site-packages/transformers/models/tapas/tokenization_tapas.py", line 1663, in _to_coordinates return [(coords[1], coords[0]) for coords in answer_coordinates_question] TypeError: 'numpy.float64' object is not iterable

Can answer_coordinates be not be None in the SQL format? Especially when I am doing weak supervision?

ManasiPat commented 2 years ago

@eisenjulian @ghost In our dataset we don't have the intermediate labels in terms of answer_coordinates and can not calculate them using the parsing utility provided as the queries are aggregation queries and none of the cells would match the answer. We are using pytorch Hugginface TAPAS. If we try to pass answer_coordinates as None the code throws an error. Our question is what to be passed as answer_coordinates in such a scenario? When we tried to make the labels of cell selection as all zeros (as we don't know them) the model does not get trained. In short how to trained the model in the cases of weak supervision only. Please answer this as we are trying to figure this out for past month.