🐛 Bug

Information

Model I am using (Bert, XLNet ...): Bert

Language I am using the model on (English, Chinese ...): Chinese

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

My problem is return encoded Error with transformers-2.9.1! but with transformers-2.3.0 was fine! Which way is Right? Very Confused!

My code as following:

# -*- coding: utf-8 -*-

import torch
from transformers import BertTokenizer, BertModel
from torch.utils.data import Dataset, DataLoader
from functools import partial
import logging

logging.basicConfig(level=logging.INFO)

bert_path = "/Users/kiwi/Desktop/chinese_wwm_ext"
model = BertModel.from_pretrained(bert_path)
tokenizer = BertTokenizer.from_pretrained(bert_path)

def tok_collate(batch_data):
    batch_sentence = [x[0] for x in batch_data]
    encoded = tokenizer.batch_encode_plus(
        batch_sentence,
        add_special_tokens=True,
        return_tensors='pt',
        pad_to_max_length=True)

    #return encoded['input_ids'], encoded['token_type_ids'], encoded['attention_mask']

    # (tensor([[ 101,  704, 1066,  704, 1925, 2600,  741, 6381,  510, 1744, 2157,  712,
    #          2375,  510,  704, 1925, 1092, 1999,  712, 2375,  739, 6818, 2398, 8108,
    #          3189,  683, 7305, 6626, 3959, 1266, 4689, 3636, 3727, 2356, 5440, 2175,
    #          4554, 2658, 7344, 2971, 2339,  868,  102]]), tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    #          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    #          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]))

    return encoded
    #
    #   File "/Users/kiwi/anaconda/python.app/Contents/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 203, in __getattr__
    #     return self.data[item]
    #   KeyError: '__getstate__'

def data_loader(data):
    dl = DataLoader(data, batch_size=8, shuffle=False, collate_fn=partial(tok_collate),
                    num_workers=2)

    for batch_data in dl:
        print(batch_data)

data = [('中共中央总书记、国家主席、中央军委主席习近平10日专门赴湖北省武汉市考察疫情防控工作', 1)]
data_loader(data)

Expected behavior

encoded = {'input_ids': tensor([[ 101,  704, 1066,  704, 1925, 2600,  741, 6381,  510, 1744, 2157,  712,
         2375,  510,  704, 1925, 1092, 1999,  712, 2375,  739, 6818, 2398, 8108,
         3189,  683, 7305, 6626, 3959, 1266, 4689, 3636, 3727, 2356, 5440, 2175,
         4554, 2658, 7344, 2971, 2339,  868,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Environment info

transformers version:2.9.1
Platform:
Python version:3.6+
PyTorch version (GPU?):
Tensorflow version (GPU?):
Using GPU in script?:
Using distributed or parallel set-up in script?:

huggingface / transformers

error with transformers 2.9.1 but not with 2.3.0, same code, why? #5817

🐛 Bug

Information

To reproduce

Expected behavior

Environment info