Open Mistobaan opened 3 years ago
I am attempting to train the small version of Electra on a custom vocabulary.
Looking at the performances I see that max_predictions_per_seq is set to an heuristic formula:
max_predictions_per_seq
self.max_predictions_per_seq = int((self.mask_prob + 0.005) * self.max_seq_length)
And in my case is set to 19. The TPU profiler warns that this is not optimal as it requires padding.
disallow_correct False disc_weight 50.0 do_eval False do_lower_case True do_train True electra_objective True embedding_size 128 eval_batch_size 128 gcp_project None gen_weight 1.0 generator_hidden_size 0.25 generator_layers 1.0 iterations_per_loop 200 keep_checkpoint_max 5 learning_rate 0.0005 lr_decay_power 1.0 mask_prob 0.15 max_predictions_per_seq 19 max_seq_length 128 model_hparam_overrides {} model_name electra_base_breathe_small_vanilla model_size small num_eval_steps 100 num_tpu_cores 8 num_train_steps 1000000 num_warmup_steps 10000 save_checkpoints_steps 1000 temperature 1.0 tpu_job_name None tpu_zone None train_batch_size 128 uniform_generator False untied_generator True untied_generator_embeddings False use_tpu True vocab_size 65536 weight_decay_rate 0.01
Overview
I am attempting to train the small version of Electra on a custom vocabulary.
Looking at the performances I see that
max_predictions_per_seq
is set to an heuristic formula:And in my case is set to 19. The TPU profiler warns that this is not optimal as it requires padding.
Questions
Configuration
TPU Profiler Screenshots