huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.4k stars 26.37k forks source link

Seq2SeqTrainer not working for a list of inputs: TypeError: can't convert np.ndarray of type numpy.object_ #11602

Closed helloworld123-lab closed 3 years ago

helloworld123-lab commented 3 years ago

Information

Model I am using Bert2Bert, training using Seq2SeqTrainer on Google Colab.

Details

I am trying to use a Bert2Bert model on multiple-choice qa dataset using Seq2SeqTrainer. My whole code is given in the following I convert question-choices-label as follow:

input: (Question, option1), (Question, option2), ... target: label

I have generated the tokens for input using tokenizer.batch_encode_plus(.) method as:

max_length = 128
def convert_to_commonsense_qa_features(example_batch):
    num_examples = len(example_batch["question"])
    num_choices = len(example_batch["choices"][0]["text"])
    features = {}
    for example_i in range(num_examples):
        choices_inputs = tokenizer.batch_encode_plus(
            list(zip(
                [example_batch["question"][example_i]] * num_choices,
                example_batch["choices"][example_i]["text"],
            )),
            max_length=max_length, pad_to_max_length=True,
        )
        for k, v in choices_inputs.items():
            if k not in features:
                features[k] = []
            features[k].append(v)
    labels2id = {char: i for i, char in enumerate("ABCDE")}
    # Dummy answers for test
    if example_batch["answerKey"][0]:
        features["labels"] = [labels2id[ans] for ans in example_batch["answerKey"]]
    else:
        features["labels"] = [0] * num_examples    
    return features

convert_func_dict = {
    "commonsense_qa": convert_to_commonsense_qa_features,
}

for the input, when I print input_ids, it is given the following form (ndarray):

array([array([ 101, 1996, 2237, 4580, 2001, 1037, 2524, 5271, 2005, 1996, 2613,
       4263, 1010, 2009, 2001, 2157, 2279, 2000, 1037, 2152, 4125, 2054,
       1029,  102, 9282, 2458,  102,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0]),
       array([ 101, 1996, 2237, 4580, 2001, 1037, 2524, 5271, 2005, 1996, 2613,
       4263, 1010, 2009, 2001, 2157, 2279, 2000, 1037, 2152, 4125, 2054,
       1029,  102, 4545, 2311,  102,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0]),

And I train the model as:

training_args = Seq2SeqTrainingArguments(
    output_dir='./models/tpu',
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    do_train=True,
    do_eval=True,
    logging_steps=100,  # set to 1000 for full training
    warmup_steps=2000,  # set to 2000 for full training
    overwrite_output_dir=True,
    num_train_epochs = 10,
    save_steps = 12180,
    fp16=True, 
)

# instantiate trainer
trainer = Seq2SeqTrainer(
    model=bert2bert,
    tokenizer=tokenizer,
    args=training_args,
    #compute_metrics=compute_metrics,
    train_dataset=features_dict["commonsense_qa"]["train"],
    eval_dataset=features_dict["commonsense_qa"]["validation"],
)
trainer.train()

But it produces:

`TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

I could not fix it. Any ideas?

Checklist

patil-suraj commented 3 years ago

Could you please post the full error? From what I can see here, the dataset should return tensors, not NumPy arrays. This could be the issue.