google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.15k stars 217 forks source link

"Can't convert interaction: error: Too many rows" and "Can't convert interaction: Sequence too long" #14

Closed sbhttchryy closed 4 years ago

sbhttchryy commented 4 years ago

A big thank you to all the developers for this wonderful concept and for making the implementation open-sourced. I was trying to perform prediction (using the sqa base model) on a table of size 1801330 X 59.

When I try with reducing the number of columns (to 3 in my case) and using the total number of rows - it returns a "Can't convert interaction: error: Too many rows" error. It returns this error thrice, without printing any interaction.id, prompting me to believe it happens for particular rows only. Please correct me if I am mistaken.

When I reduce the number of rows and use all 59 rows it returns "Can't convert interaction: error: Too many rows", without any interaction id.

I am performing my tests in Colab with a RAM 12.72 GB and Diskspace: 68.40 GB. The table contains a mix of dates, integers, floats and characters.

I would appreciate any input for a workaround. Thank you.

vitallalaji commented 4 years ago

I'm also facing the same issue

muelletm commented 4 years ago

Thanks for opening this issue.

The way TAPAS works it needs to flatten the table into a sequence of word pieces. This sequence needs to fit into the specified maximal sequence length (default is 512). TAPAS has a pruning mechanism that will try to drop tokens but it will never drop cells. Therefore at a sequence length of 512 there is no way to fit a table with more than 512 cells.

If you really want to run the model on 1.8M rows I would suggest that you split your data row-wise. For your table for example you would need blocks with a maximum of ~8 rows.

Alternatively, you can increase the sequence size but that will also increase the cost of running the model.

I hope that helps.

sbhttchryy commented 4 years ago

Hi @muelletm , thank you for your response. I have tried it by reducing the number of rows. However, it still does not work for 53 columns. The maximum column size for which it is working is 20. Alternatively, I had increased the sequence size to 1024, but it is still not accommodating 50+ columns. It works when the sequence length is set to 2048. However, in that case, it just returns the table and not the answer to the queries.

vitallalaji commented 4 years ago

I already tried in that way i.e flattening the table with maximum sequence length 512. still facing the same issue..

vitallalaji commented 4 years ago

If the data has more than 512 cells it will not return the answer you meant to say ? Then how can I approach with large datasets ? Splitting row wise don't give exact answer..cause of answer maybe present in another rows..if I need to get the exact answer each time I have to load the data and check for the answer..it will take a lot of time isn't it? Then how can we handle this kind of situations?

vitallalaji commented 4 years ago

Hi @muelletm can I re-train the model if my data is greater than 512 cells (>512) ? Is the any parameters that we can change to fit more than 512 cells without re-training ? And did you train the model more than 512 cells and how was your results if you tried (performance wise) ? Thanks in advance..

muelletm commented 4 years ago

There is exactly one parameter of the model that is depending on the sequence length: the positional embeddings. That is currently set to a maximum of 512 tokens.

To change that you have to a) change the bert config file and b) change the matrix in the check-point.

  "max_position_embeddings": 512,

Without re-training I don't really see a good way of updating that table.

One thing you could try is to shift the embeddings, but I am not sure that will work. So assuming that the first 32 tokens are special because they usually contain the question, you could try to create a new table by just reusing some of the embeddings:

[e_0 .... e_31 e_32 ... e_511 e_32 .. e_511 e_32 .. e_511 ]

If you can fine-tune you could just not load that variable when you instantiate your model. You can see in tapas_pretraining_model.py how to discard the embeddings:

    tvars = tf.trainable_variables()

    initialized_variable_names = {}
    scaffold_fn = None
    if init_checkpoint:
      init_tvars = [
          tvar for tvar in tvars if "position_embeddings" not in tvar.name
      ]
      (assignment_map, initialized_variable_names
      ) = modeling.get_assignment_map_from_checkpoint(init_tvars,
                                                      init_checkpoint)

We did try training with longer sequences. I think we had to reduce the batch size to fit the model. On the datasets we evaluated (WTQ, SQA, WikiSQL) it didn't really give a quality improvement.

muelletm commented 4 years ago

Closing for now.