huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.92k stars 26.27k forks source link

Missing code for predicting custom labels in Bert #12163

Closed gwc4github closed 3 years ago

gwc4github commented 3 years ago

Environment info

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet ...): Bert

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. Create a dataset and load it.
  2. Set your features with new labels
  3. Load the bert_Base_cased config, transformer, and model
  4. Tokenize the data
  5. Create a trainer and start it
dataset = load_dataset('json', data_files=datasetPath + pathDel + datasetName, split='train')

# Dataset column that serves as model's input
text_column_name = "tokens"
# Dataset column that serves as fine-tuning labels (ner_tags, pos_tags, or chunk_tags in our case)
label_column_name = "ner_tags"

# Define variables used by tokenize_and_align_labels fn
column_names = dataset.column_names  # NOT USED (GWC)
label_list = features[label_column_name].feature.names
label_to_id = {label_list[i]: i for i in range(len(label_list))}

# Need to tell the model how many labels it's supposed to predict
num_labels = len(label_list)

model_name = 'bert-base-cased'
config = AutoConfig.from_pretrained(model_name, num_labels=num_labels)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, padding=True, truncation=True)  # GWC CHANGED added padding=True and truncation=True
model = AutoModelForTokenClassification.from_pretrained(model_name, config=config)

padding = True
def tokenize_and_align_labels(examples):
        tokenized_inputs = tokenizer(
            examples[text_column_name],
            padding=padding,
            truncation=True,
            # We use this argument because the texts in our dataset are lists of words (with a label for each word).
            is_split_into_words=True,
        )
        labels = []
        for i, label in enumerate(examples[label_column_name]):
            word_ids = tokenized_inputs.word_ids(batch_index=i)
            previous_word_idx = None
            label_ids = []
            for word_idx in word_ids:
                # Special tokens have a word id that is None. We set the label to -100 so they are automatically
                # ignored in the loss function.
                if word_idx is None:
                    label_ids.append(-100)
                # We set the label for the first token of each word.
                elif word_idx != previous_word_idx:
                    label_ids.append(label_to_id[label[word_idx]])
                # For the other tokens in a word, we set the label to either the current label or -100, depending on
                # the label_all_tokens flag.
                else:
                    label_ids.append(label_to_id[label[word_idx]])
                previous_word_idx = word_idx

            labels.append(label_ids)
        tokenized_inputs["labels"] = labels
        return tokenized_inputs

train_dataset = dataset.map(
            tokenize_and_align_labels,
            batched=True,
        )
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    tokenizer=tokenizer
)
print('Training dataset')
trainer.train()

Expected behavior

I am expecting it to train the model on our custom data. It was failing during training and I found the bug and fixed it. So mostly I am just trying to report the bug.
The bug is in transformers/tokenization_utils_base.py at line 2990. In the _pad() method you forgot to add an if statement for labels. More specifically you have a if self.padding_side == "right": and a if self.padding_side == "left": and both of them are missing the nested if for labels. (The have one for token_type_ids & special_tokens_mask)

You should add the section for both left and right but here is the change I made for the "right" side:

        if needs_to_be_padded:
            difference = max_length - len(required_input)
            if self.padding_side == "right":
                if return_attention_mask:
                    encoded_inputs["attention_mask"] = [1] * len(required_input) + [0] * difference
                if "token_type_ids" in encoded_inputs:
                    encoded_inputs["token_type_ids"] = (
                        encoded_inputs["token_type_ids"] + [self.pad_token_type_id] * difference
                    )
                if "labels" in encoded_inputs:
                    encoded_inputs["labels"] = (
                        encoded_inputs["labels"] + [-100] * difference
                    )
                if "special_tokens_mask" in encoded_inputs:
                    encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference
                encoded_inputs[self.model_input_names[0]] = required_input + [self.pad_token_id] * difference
.....
NielsRogge commented 3 years ago

Hi,

Tokenizers in HuggingFace Transformers don't take care of padding labels (this should be done by the user). You can only provide text to a tokenizer, and it will turn them into input_ids, attention_mask and token_type_ids. The tokenize_and_align_labels function will take care of labeling each token.

gwc4github commented 3 years ago

Thanks for this note @NielsRogge and sorry for the delay getting back to you. Lots to do here. We will make the change in our code but it seems like this would be a good feature for the framework and the code is done.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

gregg-ADP commented 2 years ago

@NielsRogge can we at least get a better error message for this?

mhdi707 commented 1 year ago

hello how can i find acceptable labels for train_data to fine tuning a pretrained transformer sentiment model>?