huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.7k stars 26.94k forks source link

NAN values appears when including a new padding token in my tokenizer #18603

Closed tessanix closed 2 years ago

tessanix commented 2 years ago

I'm trying to fine-tune a DialoGPT model on a new dataset. I already processed my data correctly and adding a new padding token in the tokenizer didn't seem to make any issue :

#my dataset : 
print(dataset)
print(dataset[0]['text'])

output

Dataset({ features: ['text'], num_rows: 48423 })

[speaker 1]: Great that you wish to hear the voices of the guitarists. Here are your booking details of the tickets. You wish to purchase 4 tickets for the event The Original Wailers that is going to take place on March 8th in Berkeley, right? [speaker 2]: Yup, you're right. Please May I know where is the event conducted and I need the complete address? [speaker 1]: Please note down the complete address of the event happening. It's at Cornerstone Craft Beer & Live Music, 2367 Shattuck Avenue. Your reservation is successful and have a great time there! [speaker 2]: Thanks much for the information you've given. Please can you help me to find some intermediate priced restaurant that provides Ethiopian kind of food. [speaker 1]: Yup! There is an Ethiopian Restaurant named Addis Restaurant providing excellent and authentic traditional Ethiopian cuisine located in Berkeley. Do you wish to reserve a table here? [speaker 2]:

#tokenizing and adding labels
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
def tokenize_function(examples):
    return tokenizer(examples["text"],  padding='max_length', add_special_tokens =True, max_length=246) #truncation=True, max_length=13)

tokenized_datasets = ds.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

tokenized_datasets = tokenized_datasets.add_column("labels", tokenized_datasets[:]['input_ids']) 

train_set = model.prepare_tf_dataset(
    tokenized_datasets,
    shuffle=True,
    batch_size=1,
)
sample = train_set.as_numpy_iterator()
sample = sample.next()

print(tokenized_datasets)
print(train_set)
print(sample)

output

Dataset({ features: ['input_ids', 'attention_mask', 'labels'], num_rows: 48423 })

<PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(1, 246), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(1, 246), dtype=tf.int64, name=None)}, TensorSpec(shape=(1, 246), dtype=tf.int64, name=None))>

({'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'input_ids': array([[ 58, 4125, 3110, 352, 5974, 314, 765, 284, 711, 440, 9190, 440, 14918, 440, 3825, 319, 616, 3359, 13, 198, 58, 4125, 3110, 362, 5974, 921, 765, 284, 3350, 262, 3496, 440, 9190, 440, 14918, 440, 3825, 4291, 262, 3195, 11, 826, 30, 198, 58, 4125, 3110, 352, 5974, 1320, 318, 826, 13, 1867, 2099, 286, 3496, 318, 340, 30, 198, 58, 4125, 3110, 362, 5974, 632, 318, 5610, 739, 262, 12136, 6536, 290, 534, 3496, 468, 2067, 13, 198, 58, 4125, 3110, 352, 5974, 20558, 617, 1637, 329, 502, 13, 198, 58, 4125, 3110, 362, 5974, 220, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257]])}, array([[ 58, 4125, 3110, 352, 5974, 314, 765, 284, 711, 440, 9190, 440, 14918, 440, 3825, 319, 616, 3359, 13, 198, 58, 4125, 3110, 362, 5974, 921, 765, 284, 3350, 262, 3496, 440, 9190, 440, 14918, 440, 3825, 4291, 262, 3195, 11, 826, 30, 198, 58, 4125, 3110, 352, 5974, 1320, 318, 826, 13, 1867, 2099, 286, 3496, 318, 340, 30, 198, 58, 4125, 3110, 362, 5974, 632, 318, 5610, 739, 262, 12136, 6536, 290, 534, 3496, 468, 2067, 13, 198, 58, 4125, 3110, 352, 5974, 20558, 617, 1637, 329, 502, 13, 198, 58, 4125, 3110, 362, 5974, 220, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257]]))

The ouputs so far seem pretty clean for me. But when I try to make a prediction with my model or train it I have nan values as output :

#Instatiation of model 
from transformers import TFAutoModelForCausalLM
model = TFAutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-medium")

optimizer = AdamWeightDecay(learning_rate=1e-9, weight_decay_rate=0.01)
model.compile(optimizer=optimizer, jit_compile=True)
#model inference
loss = model(sample[0], labels=sample[1])
print(loss)

output

TFCausalLMOutputWithCrossAttentions([('loss',

), ('logits', ), ('past_key_values', (
#model training
model.fit(train_set, epochs=1)

output

56/48423 [..............................] - ETA: 2:27:49 - loss: nan

This NAN value is certainly caused by the new token '[PAD]' added but I don't know how to deal with it. Can someone help me please ?

LysandreJik commented 2 years ago

@ydshieh, would you like to take a look at this issue?

ydshieh commented 2 years ago

Hi @tessanix, thank you for reporting. Could you provide a self-contained code snippet that could be run and reproduce the issue. So far, dataset is not defined, neither ds. And model is used (model.prepare_tf_dataset) before it is created.

It would be really helpful to have a self-contained code snippet for debugging 🙏 . Thank you.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.