Closed burakisikli closed 3 years ago
Hi, this is not the best way to update the dropout value as it will get overridden by the configuration value on load.
The classifier in BertForSequenceClassification
is a linear layer, that has no dropout. If you want to change the dropout which is applied before the linear layer, you should update the config.hidden_dropout_prob
. You can see the source code here.
The code is made to be easy to read and easy to tweak, so feel free to directly modify the source code to fit your needs.
Hi, I've already tried it but it changes all of the output dropout layers value since each layer is using same config as you can see below. I think it'd be better to have a different dropout config for the last layer since bert official example is suggesting to optimize it with changing(https://github.com/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb). This also applies to roberta as well. I guess I need to modify the source code accordingly.
config = BertConfig.from_pretrained('bert-base-uncased')
config.hidden_dropout_prob=0.7
model = BertForSequenceClassification.from_pretrained(
"bert-base-uncased",
config = config
)
model.cuda()
BertForSequenceClassification( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) ) (1): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) ) (2): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) ) .... (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.7, inplace=False) ) ) ) ) (pooler): BertPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) ) (dropout): Dropout(p=0.7, inplace=False) (classifier): Linear(in_features=768, out_features=2, bias=True) )
Yes, the model files are completely independent of each other for that purpose: it should be very easy to modify each independent model file.
Feel free to modify the model file so that it fits your needs.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Environment info
transformers
version: 3.5.0Who can help
@LysandreJik, @sgugger
Information
Model I am using (Bert, XLNet ...): Bert, Roberta
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
I'm trying to change dropout probability. I'm using one of these methods for Bert instance:
After training is completed, model is saved
Expected behavior
Dropout p is changing to default value after loading the model. But the model is modified so that it shouldn't do that behavior