Dropout p is changing after loading

burakisikli commented 4 years ago

Environment info

transformers version: 3.5.0
Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.7.0+cu101 (True)
Tensorflow version (GPU?): 2.3.0 (True)
Using GPU in script?:
Using distributed or parallel set-up in script?: No

Who can help

@LysandreJik, @sgugger

Information

Model I am using (Bert, XLNet ...): Bert, Roberta

The problem arises when using:

[ *] the official example scripts: Using information given in this link: https://huggingface.co/transformers/master/custom_datasets.html

The tasks I am working on is:

[ *] my own task or dataset: text classification

To reproduce

Steps to reproduce the behavior:

I'm trying to change dropout probability. I'm using one of these methods for Bert instance:
```
model.classifier.dropout.p=0.7
model.classifier.dropout = nn.Dropout(0.7)
```
After training is completed, model is saved

model.save_pretrained('xxx/bert')

Model is loaded in another session using this code snippet. But after loading, model.classifier.dropout.p is changing to 0.1 which is in the config file.
```
model = BertForSequenceClassification.from_pretrained("xxx/bert",
num_labels = 3, 
output_attentions = False,
output_hidden_states = False,
)
```

Expected behavior

Dropout p is changing to default value after loading the model. But the model is modified so that it shouldn't do that behavior

LysandreJik commented 4 years ago

Hi, this is not the best way to update the dropout value as it will get overridden by the configuration value on load.

The classifier in BertForSequenceClassification is a linear layer, that has no dropout. If you want to change the dropout which is applied before the linear layer, you should update the config.hidden_dropout_prob. You can see the source code here.

The code is made to be easy to read and easy to tweak, so feel free to directly modify the source code to fit your needs.

burakisikli commented 4 years ago

Hi, I've already tried it but it changes all of the output dropout layers value since each layer is using same config as you can see below. I think it'd be better to have a different dropout config for the last layer since bert official example is suggesting to optimize it with changing(https://github.com/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb). This also applies to roberta as well. I guess I need to modify the source code accordingly.

config = BertConfig.from_pretrained('bert-base-uncased') 
config.hidden_dropout_prob=0.7
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    config = config
)
model.cuda()

BertForSequenceClassification( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) ) (1): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) ) (2): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) ) .... (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) ) ) ) (pooler): BertPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) ) (dropout): Dropout(p=0.7, inplace=False) (classifier): Linear(in_features=768, out_features=2, bias=True) )

LysandreJik commented 4 years ago

Yes, the model files are completely independent of each other for that purpose: it should be very easy to modify each independent model file.

Feel free to modify the model file so that it fits your needs.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

huggingface / transformers