Hi Niels! - Githubissues

ManasiPat commented 3 years ago

Hi Niels!

In the case of WTQ we have some special logic that tries to find the answer text in the table or that populates the float_value field if the answer is a real number.

The logic is here:

https://github.com/google-research/tapas/blob/master/tapas/utils/interaction_utils_parser.py

For WTQ parse_question will be called with mode REMOVE_ALL. (The same code is used for WikiSQL where we have the supervised mode that uses the coordinates extraction from the SQL and the weakly-supervised mode where we do the same as for WTQ.)

_parse_answer_coordinates is the code that searches for the answer. _parse_answer_float will try to parse the text as a float.

_parse_answer_coordinates uses this linear optimization code but is actually just extracting the first text match of the answer in the table.

Originally posted by @ghost in https://github.com/google-research/tapas/issues/50#issuecomment-705465960

ManasiPat commented 3 years ago

@NielsRogge @ghost Can you let me know in case of Pytorch code of TAPAS we need to feed the data in SQA format. What should be fed in the answer coordinates field when the queries are of aggregation type where answer does not match with any of the table cell value.

ManasiPat commented 3 years ago

@NielsRogge @eisenjulian @ghost In our dataset we don't have the intermediate labels in terms of answer_coordinates and can not calculate them using the parsing utility provided as the queries are aggregation queries and none of the cells would match the answer. We are using pytorch Hugginface TAPAS. If we try to pass answer_coordinates as None the code throws an error. Our question is what to be passed as answer_coordinates in such a scenario? When we tried to make the labels of cell selection as all zeros (as we don't know them) the model does not get trained. In short how to trained the model in the cases of weak supervision only. Please answer this as we are trying to figure this out for past month.

SyrineKrichene commented 2 years ago

Hi,

Our model is built to first select relevant cells (scored higher than a threshold) then to apply aggregation over the selected cells. In this case you need to provide the coordinates of all cells to aggregate over. With this logic the model is not capable of finding the result of aggregation without computing the aggregation over specified cells.

Here are the fields in the interaction proto to fill: // Coordinates of cells that contain the answers. (Please put all the cells over which to aggregate) repeated AnswerCoordinate answer_coordinates = 1; // A function that is applied to the answer cells in order to obtain the final answer. (Please choose between NONE/ SUM/ AVERAGE/ COUNT) optional AggregationFunction aggregation_function = 2 // Present if the answer can be represented as a single float value, for example produced by an aggregation ('the average population of all countries'). (Please put your aggregation result) optional float float_value = 4;

If the supervision is removed then the aggregation step should also be removed. Thus nothing would ensure that the model understands the connection between the aggregation operation and the cells tokens. Here you are expecting a BERT based model over text to be able to act like a calculator. (Usually BERT based models (here I'm talking about non-supervised tasks) are not very efficient on achieving calculation tasks without using special tokens / additional methods (There are a lot of papers / research that improves Bert perf as a calculator) but still you can try it ). For that you need to change the TAPAS architecture model:

Easy to try: If you know/ can represent the final answer by only one token: you can use a classifier model to predict a class (from your vocab size).
More complex: change the architecture to enable sequential predictions (not implemented yet)

Thanks, Syrine

On Fri, Sep 24, 2021 at 2:56 PM ManasiPat @.***> wrote:

@NielsRogge https://github.com/NielsRogge @eisenjulian https://github.com/eisenjulian @ghost https://github.com/ghost In our dataset we don't have the intermediate labels in terms of answer_coordinates and can not calculate them using the parsing utility provided as the queries are aggregation queries and none of the cells would match the answer. We are using pytorch Hugginface TAPAS. If we try to pass answer_coordinates as None the code throws an error. Our question is what to be passed as answer_coordinates in such a scenario? When we tried to make the labels of cell selection as all zeros (as we don't know them) the model does not get trained. In short how to trained the model in the cases of weak supervision only. Please answer this as we are trying to figure this out for past month.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google-research/tapas/issues/138#issuecomment-926603573, or unsubscribe https://github.com/notifications/unsubscribe-auth/APARZOMXO3BDHN5OJSDTKATUDRYRTANCNFSM5ETCL5LA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

google-research / tapas

Hi Niels! #138