google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.15k stars 217 forks source link

When I parse table, all row_ids and column_ids are 0 #74

Closed lairikeqiA closed 4 years ago

lairikeqiA commented 4 years ago

When I parse table(max_row<=64, max_column<=32) in WTQ dataset, the cell in the (0,0) position has 512 token and all row_ids and column_ids are 0 in the case where question length is less than 512. I think the cell in the (0,0) position in your code representing question context and I ensure the question length is less than 512. Can you explain the phenomenon?

ghost commented 4 years ago

I think we need more details to debug this.

Can you share the interaction proto in text format (you can just 'print()' the interaction)?

It would also be good to know how you are calling the conversion code and what the output example looks like.

In principle if a cells has 512 tokens, we will try to trim it to fit into the 512 input sequence.

lairikeqiA commented 4 years ago

The following example is from another dataset. The table as follows(32rows and 14 columns): "Table 941. New Manufactured (Mobile) Homes Placed for Residential Use and Average","","","","","","","","","","","","" "Sales Price by Region","","","","","","","","","","","","" "","","","","","","","","","","","","" "See Notes","","","","","","","","","","","","" "","","","","","","","","","","","","" "","Units placed","","","","","Average sales price","","","","","","" "","","","","","","","","","","","","" "","","","","","","","","","","","","" "Year","","","","","","","","","","","","" "","U.S.","Northeast","Midwest","South","West","U.S.","Northeast","Midwest","South","West","","" "Unit indicator","(1,000)","(1,000)","(1,000)","(1,000)","(1,000)","(dollars)","(dollars)","(dollars)","(dollars)","(dollars)","","" "1980","233.7","12.3","32.3","140.3","48.7","19800.0","18500.0","18600.0","18200.0","25400.0","","" "1985","283.4","20.2","38.6","187.6","36.9","21800.0","22700.0","21500.0","20400.0","28700.0","","" "1990","195.4","18.8","37.7","108.4","30.6","27800.0","30000.0","27000.0","24500.0","39300.0","","" "1995","319.4","15.0","57.5","203.2","43.7","35300.0","35800.0","35700.0","33300.0","44100.0","","" "1996","337.7","16.2","58.8","218.2","44.4","37200.0","37300.0","38000.0","35500.0","45000.0","","" "1997","336.3","14.3","55.3","219.4","47.3","39800.0","41300.0","40300.0","38000.0","47300.0","","" "1998","373.7","14.7","58.3","250.3","50.4","41600.0","42200.0","42400.0","40100.0","48400.0","","" "1999","338.3","14.1","53.6","227.2","43.5","43300.0","44000.0","44400.0","41900.0","49600.0","","" "2000","280.9","14.9","48.7","178.7","38.6","46400.0","47000.0","47900.0","44300.0","54100.0","","" "2002.0","174.3","11.8","34.2","101.0","27.2","51300.0","53200.0","51700.0","48000.0","62600.0","","" "2003.0","139.8","11.2","25.2","77.2","26.1","54900.0","57300.0","55100.0","50500.0","67700.0","","" "2004.0","124.4","11.0","20.6","67.4","25.5","58200.0","60200.0","58800.0","52300.0","73200.0","","" "2005.0","122.9","9.2","17.1","68.1","28.5","62600.0","67000.0","60600.0","55700.0","79900.0","","" "2006.0","112.4","7.9","14.5","66.1","23.9","64300.0","65300.0","59100.0","58900.0","83400.0","","" "2007.0","94.8","7.0","10.8","59.4","17.7","65400.0","66100.0","64900.0","59900.0","85500.0","","" "2008.0","79.3","5.0","8.2","53.0","13.1","64900.0","68400.0","65700.0","59700.0","85100.0","","" "","","","","","","","","","","","","" "Source: U.S. Census Bureau, ""Manufactured Housing"".","","","","","","","","","","","","" "","","","","","","","","","","","","" "","","","","","","","","","","","","" "","","","","","","","","","","","",""

I set a question(family) for the table. The question(family) only has one word/token. When I parse table, the cell in the (0,0) position has 512 token and all row_ids and column_ids are 0 in the case. Looking forward to your reply.

ghost commented 4 years ago

I am not able to reproduce this. I tried adding this to the WTQ colab:

input = io.StringIO(predict_text)
reader = csv.reader(input)
rows = list(reader)
new_predict_text = "\n".join("|".join(row) for row in rows[:5])
examples = convert_interactions_to_examples([(new_predict_text, ["family",])])
examples = list(examples)
print(list(examples[0].features.feature["row_ids"].int64_list.value)[:10])
print(list(examples[0].features.feature["column_ids"].int64_list.value)[:10])

predict_text was set to the string you passed above. I had to add the slice ([:5]) to avoid a "sequence too long" error. This produces this output:

[0, 0, 0, 0, 1, 2, 3, 4, 5, 6]
[0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

So everything seems to WAI.

lairikeqiA commented 4 years ago

I set the question(family) for the table. The question(family) only has one word/token. Why the cell in the (0,0) position has 512 tokens in the parsing table process? What is the mean of "sequence too long" error? Which sequence is too long?

ghost commented 4 years ago

TAPAS will represent the input as a long sequence of the tokenized question and the entire tokenized table.

This sequence has to fit into max_seq_length which is 512 in your case. If it doesn't we will try to trim some cells. That is we will try to only fit the first few tokens of each cell. However if the length of the tokenized question + the number of cells is larger than 512 this will also not work. This is when you get a sequence too long error.

ghost commented 4 years ago

Closing for now, feel free to reopen.