google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Apache License 2.0
2.31k stars 351 forks source link

max_predictions_per_seq and TPU training configuration #83

Open Mistobaan opened 3 years ago

Mistobaan commented 3 years ago

Overview

I am attempting to train the small version of Electra on a custom vocabulary.

Looking at the performances I see that max_predictions_per_seq is set to an heuristic formula:

self.max_predictions_per_seq = int((self.mask_prob + 0.005) *
                                       self.max_seq_length)

And in my case is set to 19. The TPU profiler warns that this is not optimal as it requires padding.

Questions

Configuration

disallow_correct False
disc_weight 50.0
do_eval False
do_lower_case True
do_train True
electra_objective True
embedding_size 128
eval_batch_size 128
gcp_project None
gen_weight 1.0
generator_hidden_size 0.25
generator_layers 1.0
iterations_per_loop 200
keep_checkpoint_max 5
learning_rate 0.0005
lr_decay_power 1.0
mask_prob 0.15
max_predictions_per_seq 19
max_seq_length 128
model_hparam_overrides {}
model_name electra_base_breathe_small_vanilla
model_size small
num_eval_steps 100
num_tpu_cores 8
num_train_steps 1000000
num_warmup_steps 10000
save_checkpoints_steps 1000
temperature 1.0
tpu_job_name None
tpu_zone None
train_batch_size 128
uniform_generator False
untied_generator True
untied_generator_embeddings False
use_tpu True
vocab_size 65536
weight_decay_rate 0.01

TPU Profiler Screenshots

image image