How do a put a different classifier on top of BertForSequenceClassification?

shivin9 commented 5 years ago

Hi,

Thanks for providing an efficient and easy-to-use implementation of BERT and other models.

I am working on a project that requires me to do binary classification of sentences. I am using BertForSequenceClassification for that but I am not getting good results i.e. my loss function doesn't converge. I noticed that by default there is only a single LinearClassifier on top of the BERT model. Is is possible to change that?

Thanks, Shivin

julien-c commented 5 years ago

Sure, one way you could go about it would be to create a new class similar to BertForSequenceClassification and implement your own custom final classifier.

The lib is pretty modular so you can usually subclass/extend what you need.

dhpollack commented 5 years ago

You can also replace self.classifier with your own model.

model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-cased")
model.classifier = new_classifier

where new_classifier is any pytorch model that you want.

shivin9 commented 5 years ago

ok... Thanks a lot. I will try it.

shivin9 commented 5 years ago

@dhpollack Maybe its a little unrelated to this issue, but still I'll state the situation. I am using the BERT model to classify sentences on two different datasets. It is working fine on the first dataset but not on the second. Is it possible that it is because BERT has saved its weights according to the first dataset and is loading that for the second one also and thus not performing well. For example. the model configuration looks like this for BOTH the datasets. I suspect whether it should have the same vocabulary size.

INFO:pytorch_pretrained_bert.modeling:Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 28996
}

It shows the same message on both the datasets

INFO:pytorch_pretrained_bert.tokenization:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /home/pytorch/.pytorch_pretrained_bert/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
INFO:pytorch_pretrained_bert.modeling:loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz from cache at cache/a803ce83ca27fecf74c355673c434e51c265fb8a3e0e57ac62a80e38ba98d384.681017f415dfb33ec8d0e04fe51a619f3f01532ecea04edbfd48c5d160550d9c
INFO:pytorch_pretrained_bert.modeling:extracting archive file cache/a803ce83ca27fecf74c355673c434e51c265fb8a3e0e57ac62a80e38ba98d384.681017f415dfb33ec8d0e04fe51a619f3f01532ecea04edbfd48c5d160550d9c to temp dir /tmp/tmpgummmons

How can effectively use BERT for two different datasets?

dhpollack commented 5 years ago

@shivin9 this is definitely not related to the classifier layer. Also, it's a little unclear what you what to do. Are you training on one dataset and then doing inference on another? If that's the case, then you do something like

# training
model = BertForSequenceClassification.from_pretrained("bert-base-cased")
...
model.save_pretrained("/tmp/trained_model_dir")

# inference
model = BertForSequenceClassification.from_pretrained("/tmp/trained_model_dir")

But as I said, it's unclear. If you are training on both datasets and getting good results on one but not the other than it probably has to do with your preprocessing. Good luck solving your problem.

mehdimashayekhi commented 5 years ago

Hi, I have a related question. I am experimenting with BERT for classification task. When I use `BertForSequenceClassification.from_pretrained ```, I can get 100% accuracy for a small data set. But if I have a customized classification head as shown below which is almost similar to ` `BertForSequenceClassification I get bad accuracy.

here is my customized classification head:

class Bertclfhead(nn.Module):
    def __init__(self, config, adapt_args, bertmodel):
        super().__init__()
        self.num_labels = adapt_args.num_classes
        self.config = config
        self.bert = bertmodel
        self.dropout = nn.Dropout(config['hidden_dropout_prob'])
        self.classifier = nn.Linear(config['hidden_size'], adapt_args.num_classes)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None, position_ids=None, head_mask=None):
        outputs = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
                            attention_mask=attention_mask, head_mask=head_mask)

        pooled_output = outputs[1] # see note below

        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here

        if labels is not None:
            if self.num_labels == 1:
                #  We are doing regression
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1), labels.view(-1))
            else:
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)

and I initialize my model like this:

model = Bertclfhead(bertconfig, adapt_args, BertModel.from_pretrained('bert-base-uncased'))

am I missing something?

shivin9 commented 5 years ago

@dhpollack I am first training on x and then inferring on x. Then I'm training on y and inferring on y.

I am also trying to put a BiLSTM on top of BERT but it seems that BERT doesn't output the vectors in the required format i.e. (#batches, seq_len, input_dim). Do you have any idea how it can be solved? Right now BERT is just outputting a (BATCH_SIZE, 768) sized vector. 768 being the size of hidden layer.

dhpollack commented 5 years ago

@shivin9 you should read the docs. You want to output of the hidden layers but I think an lstm on top of Bert is overkill. What you are getting now is the output of the pooling layer.

Also you should close this issue since it's clear this is not an issue with the library.

shivin9 commented 5 years ago

Yeah sure. thanks for the help.

searchlink commented 5 years ago

@mehdimashayekhi Do you solve it? Ihave the same question! By directly use BertForSequenceClassification and custom a classification similar to BertForSequenceClassification , the results totally different.

vikasFid commented 4 years ago

@dhpollack I am first training on x and then inferring on x. Then I'm training on y and inferring on y.

I am also trying to put a BiLSTM on top of BERT but it seems that BERT doesn't output the vectors in the required format i.e. (#batches, seq_len, input_dim). Do you have any idea how it can be solved? Right now BERT is just outputting a (BATCH_SIZE, 768) sized vector. 768 being the size of hidden layer.

Were you able to resolve this?

brian8128 commented 4 years ago

Re dhpollack's August 12 comment. Maybe something got changed between then and now but I found you also have to set the model's number of labels to get that to work.

model.classifier = torch.nn.Linear(768, 8)
model.num_labels = 8

mahdibaghbanzadeh commented 3 months ago

You can also replace self.classifier with your own model.
model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-cased")
model.classifier = new_classifier
where new_classifier is any pytorch model that you want.

Hi, I'm using your suggestion to customise the classifier head:


model = AutoModelForSequenceClassification.from_pretrained(PATH_TO_REPO)

# custom head
class CustomClassifier(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.classifier0 = nn.Linear(config.hidden_size, config.hidden_size)
        self.classifier1 = nn.Linear(config.hidden_size, config.hidden_size)
        self.classifier2 = nn.Linear(config.hidden_size, config.num_labels)

        drop_out = getattr(config, "cls_dropout", None)
        drop_out = self.config.hidden_dropout_prob if drop_out is None else drop_out
        self.dropout = StableDropout(drop_out)

    def forward(self, x):
        x = F.relu(self.classifier0(x))
        x = self.dropout(x)
        x = F.relu(self.classifier1(x))
        x = self.classifier2(x)
        return x

model.classifier = CustomClassifier(model.config)
model.push_to_hub(PATH_TO_REPO)

But, when I want to load this model using from_pretrained I get the following warning, which means that those additional layers are not loaded and a new head is added to the trained model. how can I resolve this issue? or do you have any idea on how should I achieve this while having other functionalities of huggingface?


model = AutoModelForSequenceClassification.from_pretrained(PATH_TO_REPO)

Some weights of the model checkpoint at {} were not used when initializing DebertaV2ForSequenceClassification: ['classifier.classifier0.bias', 'classifier.classifier0.weight', 'classifier.classifier1.bias', 'classifier.classifier1.weight', 'classifier.classifier2.bias', 'classifier.classifier2.weight']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at {} and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.```

huggingface / transformers

How do a put a different classifier on top of BertForSequenceClassification? #1001