Closed sbhttchryy closed 4 years ago
I'm also facing the same issue
Thanks for opening this issue.
The way TAPAS works it needs to flatten the table into a sequence of word pieces. This sequence needs to fit into the specified maximal sequence length (default is 512). TAPAS has a pruning mechanism that will try to drop tokens but it will never drop cells. Therefore at a sequence length of 512 there is no way to fit a table with more than 512 cells.
If you really want to run the model on 1.8M rows I would suggest that you split your data row-wise. For your table for example you would need blocks with a maximum of ~8 rows.
Alternatively, you can increase the sequence size but that will also increase the cost of running the model.
I hope that helps.
Hi @muelletm , thank you for your response. I have tried it by reducing the number of rows. However, it still does not work for 53 columns. The maximum column size for which it is working is 20. Alternatively, I had increased the sequence size to 1024, but it is still not accommodating 50+ columns. It works when the sequence length is set to 2048. However, in that case, it just returns the table and not the answer to the queries.
I already tried in that way i.e flattening the table with maximum sequence length 512. still facing the same issue..
If the data has more than 512 cells it will not return the answer you meant to say ? Then how can I approach with large datasets ? Splitting row wise don't give exact answer..cause of answer maybe present in another rows..if I need to get the exact answer each time I have to load the data and check for the answer..it will take a lot of time isn't it? Then how can we handle this kind of situations?
Hi @muelletm can I re-train the model if my data is greater than 512 cells (>512) ? Is the any parameters that we can change to fit more than 512 cells without re-training ? And did you train the model more than 512 cells and how was your results if you tried (performance wise) ? Thanks in advance..
There is exactly one parameter of the model that is depending on the sequence length: the positional embeddings. That is currently set to a maximum of 512 tokens.
To change that you have to a) change the bert config file and b) change the matrix in the check-point.
"max_position_embeddings": 512,
Without re-training I don't really see a good way of updating that table.
One thing you could try is to shift the embeddings, but I am not sure that will work. So assuming that the first 32 tokens are special because they usually contain the question, you could try to create a new table by just reusing some of the embeddings:
[e_0 .... e_31 e_32 ... e_511 e_32 .. e_511 e_32 .. e_511 ]
If you can fine-tune you could just not load that variable when you instantiate your model.
You can see in tapas_pretraining_model.py
how to discard the embeddings:
tvars = tf.trainable_variables()
initialized_variable_names = {}
scaffold_fn = None
if init_checkpoint:
init_tvars = [
tvar for tvar in tvars if "position_embeddings" not in tvar.name
]
(assignment_map, initialized_variable_names
) = modeling.get_assignment_map_from_checkpoint(init_tvars,
init_checkpoint)
We did try training with longer sequences. I think we had to reduce the batch size to fit the model. On the datasets we evaluated (WTQ, SQA, WikiSQL) it didn't really give a quality improvement.
Closing for now.
A big thank you to all the developers for this wonderful concept and for making the implementation open-sourced. I was trying to perform prediction (using the sqa base model) on a table of size 1801330 X 59.
When I try with reducing the number of columns (to 3 in my case) and using the total number of rows - it returns a "Can't convert interaction: error: Too many rows" error. It returns this error thrice, without printing any interaction.id, prompting me to believe it happens for particular rows only. Please correct me if I am mistaken.
When I reduce the number of rows and use all 59 rows it returns "Can't convert interaction: error: Too many rows", without any interaction id.
I am performing my tests in Colab with a RAM 12.72 GB and Diskspace: 68.40 GB. The table contains a mix of dates, integers, floats and characters.
I would appreciate any input for a workaround. Thank you.