Closed xiaopi-ouo closed 3 years ago
cc @NielsRogge
It doesn't work because type_vocab_sizes=[3, 256, 256, 2, 256, 256, 10]
which means the default max size of a table is 256*256. However, the tokenizer doesn't check it.
Strictly speaking, it's not a bug but I think in that kind of tasks, many tables' size can exceed the default value.
Maybe it's good to add codes to handle this problem.
Ok I've investigated this a bit more. The problem here is that there's a column called "Area (acres, 1872)" in the table whose values cause the column_ranks
token types to exceed the vocab size of 256. Deleting this column resolves the issue. Also, the table that you provide is actually (way) too large for TAPAS. Note that it only has a max sequence length of 512 tokens, and that in the original paper, only tables having at most 500 cells were considered.
To correctly truncate the table, you have to initialize TapasTokenizer as follows:
from transformers import TapasTokenizer
tokenizer = TapasTokenizer.from_pretrained("google/tapas-base-finetuned-tabfact", drop_rows_to_fit=True)
and then encode the table + query as follows:
inputs = tokenizer(table=table, queries=queries,
padding="max_length",
truncation=True, return_tensors="pt")
cc @LysandreJik
It's maybe a bit weird that people have to initialize TapasTokenizer
with drop_rows_to_fit
to True and set truncation
to True when calling it.
Thank you for the reply. It's really helpful!
You're right @NielsRogge; should we switch the drop_rows_to_fit
to True
by default given that no models can handle it set to False
when overflowing?
I guess it should only be set to True
when calling the tokenizer with truncation
set to True
or to drop_rows_to_fit
. If the user does not specify truncation, and the table is too large, then an error will be thrown as shown here.
Fixed by #9507
Environment info
transformers
version: 4.1.0.dev0Who can help
May @LysandreJik help?
Information
Model I am using (Bert, XLNet ...): TAPAS
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
data = json.loads(data) model = TapasModel.from_pretrained("google/tapas-base-finetuned-tabfact") tokenizer = TapasTokenizer.from_pretrained("google/tapas-base-finetuned-tabfact") table = pd.DataFrame(data['table_list'][1:], columns=data['table_list'][0]).astype(str) queries = [data['sentence_annotations'][0]['final_sentence']] inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="pt", max_length=512, truncation=True) input_ids = inputs['input_ids'] attention_mask = inputs['attention_mask'] token_type_ids = inputs['token_type_ids'] x = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
IndexError Traceback (most recent call last)