google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.15k stars 217 forks source link

Low accuracy on larger tables #119

Open joshplasse opened 3 years ago

joshplasse commented 3 years ago

Hi -- I am trying to fine-tune the wikisql base model to accurately make predictions for tables that have up to 150 rows and have noticed that the prediction accuracy goes down significantly when considering larger tables. I have updated the config file so that max_row_num=150, and have increased max_length=1024. I am able to successfully fine-tune the model using these updated parameters, and the changes allow TAPAS to predict rows later in the table (e.g., row idx > 64). However, ​the model rarely makes correct predictions for queries whose answer coordinates appear past row 64. Are there other config parameters that need to be considered when making predictions on larger tables?

Further, I am aware that transformer's computational complexity is quadratic in the length of the tokenized sequence; although, using GPUs I am able to fine-tune the model with an increased sequence length (which allows for all the training tables to be tokenized without truncation). Are there any resources that discuss a degradation in accuracy when making predictions on larger tables?

eisenjulian commented 3 years ago

Hi @joshplasse can you share some information about your dataset, is it public? How big is it? By altering the number of rows and length solely for fine-tuning some of the embeddings (positional and row_index) will have to be trained from random and potentially contributes to the problems you are seeing. I recommend considering the following options:

  1. Make sure to use the checkpoints marked with reset since re-starting the positional embeddings at every cell helps with generalizing to longer sequences
  2. I know this might not be easy, but perhaps pre-training on a set of larger tables could help
  3. Split the table into multiple chunks and then join the results using a heuristic
  4. Try heuristic ways of trimming the table content to restrict it to whatever is relevant. We tried something like this for https://arxiv.org/abs/2010.00571 but mostly on a column level, but perhaps in your case the row level makes more sense.

Finally we have some other works in the way on the problem on larger tables, hence we would love to see if there are public datasets where this problem is evident. One such paper and code release should be coming up in the next few weeks.