mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes
https://mlcommons.org/en/groups/training
Apache License 2.0
92 stars 66 forks source link

RNN-T hyperparamters #425

Closed mwawrzos closed 3 years ago

github-actions[bot] commented 3 years ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

johntran-nv commented 3 years ago

@petermattson could you please review or delegate?

petermattson commented 3 years ago

+Elias Mizan emizan@google.com could you please take a look?

On Fri, Feb 19, 2021 at 10:33 AM johntran-nv notifications@github.com wrote:

@petermattson https://github.com/petermattson could you please review or delegate?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mlcommons/training_policies/pull/425#issuecomment-782260415, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIIVUHMIDX5RERIDK7VQ4NLS72VGJANCNFSM4XDGUQAQ .

emizan76 commented 3 years ago

The only question I have is whether there is a document that justifies why some of these hyperparams are constrained. Or these constraints are "standard" for RNN-T? For example, all the DALI hyperparams are hard coded, even though we do not really need to use DALI. Does this prohibit anyone who uses DALI to change these hyperparams? Another example is opt_lamb_epsilon which can only be 1e-9, and opt_weight_decay (1e-3) while other lamb params are unconstrained.

emizan76 commented 3 years ago

Sounds good, so the following params are constrained:

All data_ parameters -- I looked at the rules and it makes sense, likewise for eval_samples

opt_lamb_epsilon: 1e-9 opt_weight_decay: 1e-3 opt_gradient_clip_norm: None opt_learning_rate_alt_delay_func: True opt_learning_rate_alt_warmup_func: True opt_lamb_learning_rate_min: 1e-5

I do not see a reason for the above params being constrained, however I am OK with these hard values. The rest are unconstrained. I will send an e-mail to the group and to see if anyone has an objection, if there is no objection after a couple of days, I will approve it so we can move on.

emizan76 commented 3 years ago

Pasting George's and Marek's e-mail comments here, and approving since nobody has raised strong objections:

George:

These look mostly fine to me as well. Only minor comment similar to Elias is that I don’t understand why lamb.opt_weight_decay is constrained to 1e-3 while opt_base_learning_rate, beta_1, beta_2 etc. are unconstrained. I can understand opt_lamb_epsilon being constrained cause it may not be a useful tuning parameter but opt_weight_decay could possibly be.

Marek

Beta_1 and beta_2 were helpful in the case of the BERT benchmark, and I expect the same result here. I propose to freeze the opt_weight_decay, because it's desirable to constrain as many hyperparameters as possible to reduce > the burden on submitters to shmoo, and nobody has shown that opt_weight_decay must be unconstrained for a good reason. Can you share some reasoning for unfreezing more parameters? Maybe some data from training, or an example of convergence improvement/stabilization in a case of a different model, when the particular parameter is tuned. Unfreezing is fine if it makes the benchmark better.

George

Here is a very recent example of BERT training where weight_decay is considered a tuning parameter: https://arxiv.org/pdf/1904.00962.pdf

I agree that for the RNN-t model, we don’t have evidence whether opt_weight_decay tuning is helpful or not. In the absence of data going either way, I feel like that its more fair to keep it unconstrained. In the reference submission, there must be a reason why opt_weight_decay is 1e-3 and not the default value of zero.

Anyway, I don’t want to stop approval of this or anything – but probably the whole group should weigh in.