google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.15k stars 217 forks source link

Table length #61

Closed bpquestion closed 4 years ago

bpquestion commented 4 years ago

I'm wondering how is it possible to work with long tables (sequence length more than 512) with tapas. Also, is it possible to have more than 1 table as an input for prediction and then find the right table and make predictions. Thanks !

NielsRogge commented 4 years ago

As stated in the paper (limitations section):

"TAPAS handles single tables as context, which are able to fit in memory. Thus, our model would fail to capture very large tables, or databases that contain multiple tables. In this case, the table(s) could be compressed or filtered, such that only relevant content would be encoded, which we leave for future work."

So an idea could be to filter the table on relevant rows first, before feeding it to Tapas. This is what is done in the TaBERT paper by Facebook AI (which does something similar to TAPAS). They first create a "content snapshot" which only contains the rows of the table most relevant to the question. As quoted from their paper:

"We use a simple strategy to create content snapshots of K rows based on the relevance between the utterance and a row. For K > 1, we select the top-K rows in the input table that have the highest n-gram overlap ratio with the utterance. For K = 1, to include in the snapshot as much information relevant to the utterance as possible, we create a synthetic row by selecting the cell values from each column that have the highest n-gram overlap with the utterance."

ghost commented 4 years ago

Thanks for the answer!

Yes, filtering with some heuristic can help to reduce the input length.

We do not support multiple tables on the input size at this point.

bpquestion commented 4 years ago

Thanks NielsRogge for your answer. I read the Tabert paper but they do not give any working example (notebook or code) for prediction and for the selection of relevant rows (snapshot) in their github. Can you please show us the code related to this part? Thanks in advance.

NielsRogge commented 4 years ago

Hmm I'm not sure if I can help, maybe you can take a look at this notebook, in which they show how to compute N-grams and compute similarity.

In what format do the tables come?

eisenjulian commented 4 years ago

Will close this one since I think there's a good answer already. Thank you all, please re-open if needed.