About the model parameters of the bert-base model

Facico commented 2 years ago

Paper and github repositories are provided with the bert-large parameter, but they didn't provide the finetune parameters for bert-base.(such as batch size, the lr...). If anybody can help me with this it would be greatly appreciated.

SyrineKrichene commented 2 years ago

Hi,

The hyper parameters specified in tapas/utils/hparam_utils.py are used for all the models (small/ medium/ base/ large) except if it is explicitly defined differently in the corresponding papers. The number of training_steps is set using num_train_examples and train_batch_size: given by num_train_examples // train_batch_size.

The usual BASE architecture that we use for our models is: num_layers=12, hidden_size=768, num_heads=12, intermidiate_size=3072. Most of the time we use max_position_embedding=1024. But this number changes usually when we compare our models to the other state of the art models: For fair comparison we use similar input size. If you load one of our models you can find additional info in bert_config.json such as drop_out.

Can you please provide us with the exact model that you are looking at / the task to solve / the paper / and if it is a pre-training/ fine-tuning step, I can re-check in the corresponding experiments.

Thanks, Syrine

On Mon, Nov 8, 2021 at 8:22 AM Facico @.***> wrote:

Paper and github repositories are provided with the bert-large parameter. Can anyone provide the finetune parameters for bert-base？(such as batch size, the lr...)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google-research/tapas/issues/145, or unsubscribe https://github.com/notifications/unsubscribe-auth/APARZOJVZKWXUP4T6EIB3WTUK53ENANCNFSM5HR5YMEQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Facico commented 2 years ago

Hello, Syrine.

I tried inference and fine-tunning on the SQA dataset and used tapas-base on the huggingface model, which has the same parameters as the corresponding config.json. I think the batch size of the base model should be smaller than the large model setting(batch size 128 、20-hour training in 32 TPUs). So I set the hyper-parameters as follows:

batch size = 1/2/4/8/16/32
gradient_accumulation_steps = 1
epoch = 50/100
warmup_ratio = 0.2 
lr = 1.25e-5

However, I only get the accurary 0.6 in training data(The model almost converges) and 0.5 in test data.

I also used the model tapas-base-finetuned-sqa on the huggingface model and got the accurary 0.73 in test data.

I wonder if I need a larger batch size to train tapas-base.

Thanks.

SyrineKrichene commented 2 years ago

Hi,

I checked in our experiments: I took this models as example: SQA

BASE noreset 0.6393 tapas_sqa_masklm_base.zip https://storage.googleapis.com/tapas_models/2020_08_05/tapas_sqa_masklm_base.zip BASE reset 0.6689 tapas_sqa_masklm_base_reset.zip https://storage.googleapis.com/tapas_models/2020_08_05/tapas_sqa_masklm_base_reset.zip The hyper parameters are exactly the ones used for large models including:

train_batch_size': 128,
warmup_ratio': 0.01,
num_train_examples': 200000 128,- learning_rate': 5e-5 (128 / 512),

Reducing the train batch size might affect negatively the performance of the model. (The model learning become more sensitive to the randomly selected examples)

If you want to keep train_batch_size=32 you can try:

using a lower lr = 1.25e-5 / 4 ( you would probably need more iterations)
you can learn multiple models and set tf_random_seed.

Thanks, Syrine

On Mon, Nov 22, 2021 at 4:47 PM Facico @.***> wrote:

Hello, Syrine.

I tried inference and fine-tunning on the SQA dataset and used tapas-base on the huggingface model, which has the same parameters as the corresponding config.json. I think the batch size of the base model should be smaller than the large model setting(batch size 128 、20-hour training in 32 TPUs). So I set the hyper-parameters as follows:

batch size = 1/2/4/8/16/32

gradient_accumulation_steps = 1

epoch = 50/100

warmup_ratio = 0.2

lr = 1.25e-5

However, I only get the accurary 0.6 in training data(The model almost converges) and 0.5 in test data.

I also used the model tapas-base-finetuned-sqa on the huggingface model and got the accurary 0.73 in test data.

I wonder if I need a larger batch size to train tapas-base.

Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google-research/tapas/issues/145#issuecomment-975659926, or unsubscribe https://github.com/notifications/unsubscribe-auth/APARZOIFJIPZ3ILFTCZNKJDUNJQYRANCNFSM5HR5YMEQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Facico commented 2 years ago

Hi, I checked in our experiments: I took this models as example: SQA BASE noreset 0.6393 tapas_sqa_masklm_base.zip https://storage.googleapis.com/tapas_models/2020_08_05/tapas_sqa_masklm_base.zip BASE reset 0.6689 tapas_sqa_masklm_base_reset.zip https://storage.googleapis.com/tapas_models/2020_08_05/tapas_sqa_masklm_base_reset.zip The hyper parameters are exactly the ones used for large models including: - train_batch_size': 128, - warmup_ratio': 0.01, - num_train_examples': 200000 128,- learning_rate': 5e-5 (128 / 512), Reducing the train batch size might affect negatively the performance of the model. (The model learning become more sensitive to the randomly selected examples) If you want to keep train_batch_size=32 you can try: - using a lower lr = 1.25e-5 / 4 ( you would probably need more iterations) - you can learn multiple models and set tf_random_seed. Thanks, Syrine … On Mon, Nov 22, 2021 at 4:47 PM Facico @.***> wrote: Hello, Syrine. I tried inference and fine-tunning on the SQA dataset and used tapas-base on the huggingface model, which has the same parameters as the corresponding config.json. I think the batch size of the base model should be smaller than the large model setting(batch size 128 、20-hour training in 32 TPUs). So I set the hyper-parameters as follows: batch size = 1/2/4/8/16/32 gradient_accumulation_steps = 1 epoch = 50/100 warmup_ratio = 0.2 lr = 1.25e-5 However, I only get the accurary 0.6 in training data(The model almost converges) and 0.5 in test data. I also used the model tapas-base-finetuned-sqa on the huggingface model and got the accurary 0.73 in test data. I wonder if I need a larger batch size to train tapas-base. Thanks. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#145 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/APARZOIFJIPZ3ILFTCZNKJDUNJQYRANCNFSM5HR5YMEQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Thanks.

google-research / tapas

About the model parameters of the bert-base model #145