philschmid / deep-learning-pytorch-huggingface

MIT License
618 stars 143 forks source link

Problem with preprocess_function() in tutorial #2

Closed ybagoury closed 1 year ago

ybagoury commented 1 year ago

Hello,

I was following your tutorial on fine-tuning a FLAN-T5 model. However, I’ve encountered an error for which I don’t understand where it stems from. At the line:

labels = tokenizer(text_target=sample["headline"], max_length=max_target_length, padding=padding, truncation=True)

I've changed the field obviously to match any dataset and I get this error:

_TypeError: __call__() missing 1 required positional argument: 'text'_

I'm using this dataset: _JulesBelveze/tldrnews on Hugging Face. The keys are : ['headline', 'content', 'category']

Here's all the relevant code:

model_checkpoint = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def preprocess_function(sample, padding="max_length"):

    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["content"]]

    # Tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["headline"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = raw_datasets.map(preprocess_function, batched=True, remove_columns=["headline", "content", "category"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")
philschmid commented 1 year ago

Can you share what transformers version you have installed?

ybagoury commented 1 year ago

It is version 4.20.1.

philschmid commented 1 year ago

Can you please use the latest version

ybagoury commented 1 year ago

Thank you, problem fixed.