huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.29k stars 26.35k forks source link

error with transformers 2.9.1 but not with 2.3.0, same code, why? #5817

Closed wenfeixiang1991 closed 4 years ago

wenfeixiang1991 commented 4 years ago

🐛 Bug

Information

Model I am using (Bert, XLNet ...): Bert

Language I am using the model on (English, Chinese ...): Chinese

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

My problem is return encoded Error with transformers-2.9.1! but with transformers-2.3.0 was fine! Which way is Right? Very Confused!

My code as following:

# -*- coding: utf-8 -*-

import torch
from transformers import BertTokenizer, BertModel
from torch.utils.data import Dataset, DataLoader
from functools import partial
import logging

logging.basicConfig(level=logging.INFO)

bert_path = "/Users/kiwi/Desktop/chinese_wwm_ext"
model = BertModel.from_pretrained(bert_path)
tokenizer = BertTokenizer.from_pretrained(bert_path)

def tok_collate(batch_data):
    batch_sentence = [x[0] for x in batch_data]
    encoded = tokenizer.batch_encode_plus(
        batch_sentence,
        add_special_tokens=True,
        return_tensors='pt',
        pad_to_max_length=True)

    #return encoded['input_ids'], encoded['token_type_ids'], encoded['attention_mask']

    # (tensor([[ 101,  704, 1066,  704, 1925, 2600,  741, 6381,  510, 1744, 2157,  712,
    #          2375,  510,  704, 1925, 1092, 1999,  712, 2375,  739, 6818, 2398, 8108,
    #          3189,  683, 7305, 6626, 3959, 1266, 4689, 3636, 3727, 2356, 5440, 2175,
    #          4554, 2658, 7344, 2971, 2339,  868,  102]]), tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    #          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    #          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]))

    return encoded
    #
    #   File "/Users/kiwi/anaconda/python.app/Contents/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 203, in __getattr__
    #     return self.data[item]
    #   KeyError: '__getstate__'

def data_loader(data):
    dl = DataLoader(data, batch_size=8, shuffle=False, collate_fn=partial(tok_collate),
                    num_workers=2)

    for batch_data in dl:
        print(batch_data)

data = [('中共中央总书记、国家主席、中央军委主席习近平10日专门赴湖北省武汉市考察疫情防控工作', 1)]
data_loader(data)

Expected behavior

encoded = {'input_ids': tensor([[ 101,  704, 1066,  704, 1925, 2600,  741, 6381,  510, 1744, 2157,  712,
         2375,  510,  704, 1925, 1092, 1999,  712, 2375,  739, 6818, 2398, 8108,
         3189,  683, 7305, 6626, 3959, 1266, 4689, 3636, 3727, 2356, 5440, 2175,
         4554, 2658, 7344, 2971, 2339,  868,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Environment info

LysandreJik commented 4 years ago

Hi! Indeed, I can reproduce. This seems to be an edge-case we were not testing for in late v2+ versions. It does run correctly on v2.3.0, as you've shown, and on recent versions (v3+) as well.

After looking a bit deeper into it, it seems to have happened because of the introduction of BatchEncoding, in version v2.9.0. It was later patched in v3.0.0 so the blacklisted versions for using a parallelization mechanism (here the dataloader) with batch_encode_plus would be the versions between v2.9.0 and v3.0.0. That would be v2.9.x, v2.10.x and v2.11.x.

Hope this helps, and sorry for the inconvenience.