huggingface / notebooks

Notebooks using the Hugging Face libraries 🤗
Apache License 2.0
3.66k stars 1.53k forks source link

【Need Help!】 About handling of the "labels" in the Huggingface Tutorial #89

Open beyondguo opened 3 years ago

beyondguo commented 3 years ago

Hi @sgugger , I'm a beginner to Huggingface, I really love your tutorial which is best course I've ever seen in AI.

However, I got a little confused in the tutorial "Fine-tuning a pretrained model-A full training" part (https://huggingface.co/course/chapter3/4?fw=pt), there mentioned:

# Rename the column label to labels (because the model expects the argument to be named labels).
...
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
...
train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
...

I don't think we have to manually rename the "label" to "labels", since in the source code of data_collator.py, there is:

class DataCollatorWithPadding:

    ...

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        ...
        if "label" in batch:
            batch["labels"] = batch["label"]
            del batch["label"]
        if "label_ids" in batch:
            batch["labels"] = batch["label_ids"]
            del batch["label_ids"]
        return batch

where the column "lable" has already been changed to "labels".

I have tested the version WITHOUT the line below:

tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

And found that the "label" has been automatically changed to "labels":

tokenized_datasets = tokenized_datasets.remove_columns(['sentence1', 'sentence2','idx'])
# tokenized_datasets = tokenized_datasets.rename_column('label','labels')
tokenized_datasets.set_format('torch')
print(tokenized_datasets['train'].column_names)

output: ['attention_mask', 'input_ids','label', 'token_type_ids']

from torch.utils.data import DataLoader, Dataset
train_dataloader = DataLoader(tokenized_datasets['train'], shuffle=True, batch_size=8, collate_fn=data_collator)
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

output: {'attention_mask': torch.Size([8, 65]), 'input_ids': torch.Size([8, 65]), 'token_type_ids': torch.Size([8, 65]), 'labels': torch.Size([8])}

That is, "label" has been automatically changed to "labels" by the data_collator.

sgugger commented 3 years ago

Hi there! We have a forums you can use for questions like this, as we like to keep the issues for bugs and feature requests only. In this instance you are right that the data collator does rename the labels automatically, so it's not strictly necessary to do it before.

However we wanted to show that renaming when doing everything by hand, does that make sense?

beyondguo commented 3 years ago

Oh, Thank you so much! :)