huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.29k stars 26.35k forks source link

`TypeError: unhashable type: 'list'` when using DataCollatorForWholeWordMask #8525

Closed sikfeng closed 3 years ago

sikfeng commented 3 years ago

Environment info

Who can help

@sgugger

Information

Model I am using (Bert, XLNet ...): Bert

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

This is the code I am running

from transformers import BertTokenizer, BertForMaskedLM, AdamW, BertConfig, get_linear_schedule_with_warmup, pipeline, DataCollatorForWholeWordMask, DataCollatorForLanguageModeling

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=True)

sent = "The capital of France is Paris."

data_collator = DataCollatorForWholeWordMask(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

encoded = tokenizer(
  sent,
  truncation = True,
  add_special_tokens = True,
  max_length = 64,
  return_tensors = 'pt',
  return_special_tokens_mask = True,
)

masked = data_collator([encoded])

print(masked)

It gives me the following error

Traceback (most recent call last):
  File "sadness.py", line 19, in <module>
    masked = data_collator([encoded])
  File "/home/csikfeng/.local/lib/python3.7/site-packages/transformers/data/data_collator.py", line 328, in __call__
    token = self.tokenizer._convert_id_to_token(id)
  File "/home/csikfeng/.local/lib/python3.7/site-packages/transformers/tokenization_bert.py", line 241, in _convert_id_to_token
    return self.ids_to_tokens.get(index, self.unk_token)
TypeError: unhashable type: 'list'

But if instead I use data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15) like this

from transformers import BertTokenizer, BertForMaskedLM, AdamW, BertConfig, get_linear_schedule_with_warmup, pipeline, DataCollatorForWholeWordMask, DataCollatorForLanguageModeling

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=True)

sent = "The capital of France is Paris."

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

encoded = tokenizer(
  sent,
  truncation = True,
  add_special_tokens = True,
  max_length = 64,
  return_tensors = 'pt',
  return_special_tokens_mask = True,
)

masked = data_collator([encoded])

print(masked)

I do not get any errors

{'input_ids': tensor([[[  101, 10105, 12185, 10108, 63184, 10112, 10124, 24289, 10107,   119,
            102]]]), 'token_type_ids': tensor([[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]]), 'attention_mask': tensor([[1]]), 'labels': tensor([[[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]]])}

Expected behavior

It should just perform whole word masking and not have errors.

sgugger commented 3 years ago

Can reproduce, working on a fix right now.

sgugger commented 3 years ago

So, DataCollatorForWholeWordMask has a few deisgn flaws (it only works for BERT for instance) and fixing it is not directly doable (basically what it tries to do should be done at the tokenization level). I will adapt the run_mlm_wwm example to stop using it and we will probably deprecate it afterward.

For your specific problem however, there is a fix, which is to remove the return_tensors='pt' from the tokenzier call.

sikfeng commented 3 years ago

This solves my problem, thanks!