Closed johntran-nv closed 3 years ago
Perhaps we should discuss again next meeting, but my memory from the last discussion was that we wanted to make sure people did not use specific packs to improve their convergence. I specifically recall we discussed not wanting to allow arbitrary packing (ie pack all the size 1 tokens together).
Lets indeed discuss this again. IMHO, disallowing the most efficient packing should disallow bucketing elsewhere too as they have similar overall effect. However, as a mitigating effect, there is a requirement of running a minimum of 3M steps, so I would imagine that faster convergence from particular packs wouldnt matter much.
On Fri, Jan 8, 2021 at 4:00 PM johntran-nv notifications@github.com wrote:
Perhaps we should discuss again next meeting, but my memory from the last discussion was that we wanted to make sure people did not use specific packs to improve their convergence. I specifically recall we discussed not wanting to allow arbitrary packing (ie pack all the size 1 tokens together).
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mlcommons/training_policies/pull/411#issuecomment-757054290, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABB363QDCRSBMG2FSVOF7N3SY6MABANCNFSM4VX4FWZQ .
-- Mrinal Iyer, Ph.D. AI Applications Specialist Graphcore
I think "(d) hyperparameter borrowing is still possible, meaning that same set of hyperparameters should converge similarly with or without packing." is well intentioned and the right goal. I am not sure if this is actually true. I think an opinion from a submitter who submits packed language models is needed to validate this doesn't disqualify their previous work.
This PR is no longer valid. We are following up in PR 418. Closing.
Clarifying packing/padding rules for https://github.com/mlcommons/training_policies/issues/376.