Overview

I am attempting to train the small version of Electra on a custom vocabulary.

Looking at the performances I see that max_predictions_per_seq is set to an heuristic formula:

self.max_predictions_per_seq = int((self.mask_prob + 0.005) *
                                       self.max_seq_length)

And in my case is set to 19. The TPU profiler warns that this is not optimal as it requires padding.

Questions

How was that size heuristic formulated, any insights?
Is that sequence referenced to a single example length?
Any advice on how to change the max_predictions_per_seq variable?
is there a way to set a min_predictions per seq? would make sense?

Configuration

disallow_correct False
disc_weight 50.0
do_eval False
do_lower_case True
do_train True
electra_objective True
embedding_size 128
eval_batch_size 128
gcp_project None
gen_weight 1.0
generator_hidden_size 0.25
generator_layers 1.0
iterations_per_loop 200
keep_checkpoint_max 5
learning_rate 0.0005
lr_decay_power 1.0
mask_prob 0.15
max_predictions_per_seq 19
max_seq_length 128
model_hparam_overrides {}
model_name electra_base_breathe_small_vanilla
model_size small
num_eval_steps 100
num_tpu_cores 8
num_train_steps 1000000
num_warmup_steps 10000
save_checkpoints_steps 1000
temperature 1.0
tpu_job_name None
tpu_zone None
train_batch_size 128
uniform_generator False
untied_generator True
untied_generator_embeddings False
use_tpu True
vocab_size 65536
weight_decay_rate 0.01

google-research / electra

max_predictions_per_seq and TPU training configuration #83

Overview

Questions

Configuration

TPU Profiler Screenshots