mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes
https://mlcommons.org/en/groups/training
Apache License 2.0
92 stars 65 forks source link

Clarification of Non-Determinism for Serial Steps in Models #36

Open bitfort opened 6 years ago

bitfort commented 6 years ago

MLPerf generally prohibits introduction of non-determinism to achieve speedups. There is grey area here, though. For example, I generally believe that MLPerf allows for non-determinism when it comes from threading and interleaving (or resulting from different batch sizes) as scaling up models or where such non-determinism is already inherent from shuffling.

I want to clarify that it is my understanding such non-determinism is explicitly not allowed (at least for v0.5) to introduce lazy/eventual consistency where the reference implementation uses strong consistency. For example, using stale / eventually consistent gradients (or models) is prohibited unless the reference implementation does the same.

ad6fp commented 6 years ago

What about other sources of non-determinism e.g. silent data corruption of hardware state, stochastic rounding. Will these be permitted? Silent data corruption due to high energy particles is challenging to eliminate from all hardware structures. I suppose stochastic rounding is permitted as long as it can be deterministically reproduced? How are the sources of non-determinism expected to be validated?

The example of eventual consistent gradients being not permitted would then imply mechanisms like deep gradient compression are not permitted? Isn't it possible for eventually consistent gradients to be deterministic?

bitfort commented 6 years ago

SWG Recommendation:

TheKanter commented 6 years ago

It seems like prohibiting stale gradients or eventual consistency will penalize several vendors (particularly those that have limited storage). For example, Graphcore has been fairly clear that they believe recomputing gradients from snapshots is a valid option to reduce the memory footprint.

bitfort commented 6 years ago

Allow me to clarify one thing: this recommendation refer to the "Closed" division of mlperf (as opposed to the open division).

If there are considerations here you think may have been missed, please don't hesitate to reach out. It's not our intention to penalize any vendor; please bring this up this issue on the MLPerf mailing list (https://groups.google.com/forum/#!forum/mlperf) to continue discussion on this topic.

bitfort commented 6 years ago

SWG:

We will reach out for more comments and re-evaluate this to provide more clarify and incorporate more input. Please reach out if you'd like to be a part of this process.

ad6fp commented 6 years ago

Victor -

I would like to be part of the process.

Gary Lauterbach gary@cerebras.netmailto:gary@cerebras.net CTO and Co-Founder Cerebras Systems

On Aug 2, 2018, at 11:41 AM, Victor Bittorf notifications@github.com<mailto:notifications@github.com> wrote:

SWG:

We will reach out for more comments and re-evaluate this to provide more clarify and incorporate more input. Please reach out if you'd like to be a part of this process.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/mlperf/policies/issues/36#issuecomment-410027660, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ATCb4qBK3jXPEzFXVRUlCVGB898Aojuwks5uM0fOgaJpZM4VTbQ0.

TheKanter commented 6 years ago

Victor, the distinction between closed and open is more important than you may realize.

In the initial discussion of MLPerf, the open division was characterized as being good for research, for speculative techniques and experiments that are outside of common practices. The closed division was characterized as being good for 'apples to apples' comparisons for real hardware and systems.

Prohibiting certain techniques (e.g., batch size = 1, eventual consistency), is an implicit statement by MLPerf that those techniques are experiment and not commercially relevant. That's a problem for any company interested in such techniques, because it's MLPerf saying they are merely experimental.

petermattson commented 6 years ago

Hi all, Philosophically, closed implementations should be mathematically equivalent subject to basic computing limits (e.g. fp ordering). We've allowed some small deviations that come at a cost (e.g. fp normalization is clearly defined and adds work). This is a more challenging issue because: (1) it essentially allows use of a different optimizer (2) in some cases, it may be a performance optimization on larger architectures So it's both a fairly big deviation, and one that may sometimes come with a performance gain rather than a cost.

We should strive to admit a range of architectures to closed, but still need to draw the line somewhere and say "this is a different class of thing." Some questions that might help consider this change: (1) What principles should we use to make such calls that don't admit everything? (2) In this particular case, what data could we use to make a decision? E.g. for what memory sizes is this effectively required? What work has been done on performance impacts? (3) What approaches might we use to limit delta and/or impose cost if we decided to admit it? Best, Peter

petermattson commented 5 years ago

Reconsider as part of both ends of batch size scaling.

bitfort commented 5 years ago

SWG:

Action Item to Gary (@ad6fp): what would be a conservative small batch size threshold to allow this? Do you have data to show this is necessary below that batch size?

bitfort commented 5 years ago

SWG:

Something along these lines may be required to support small batch sizes. We are working on enable to both large and small batch sizes. Our understanding is that this is not needed for this submission cycle. This can be revisited during a future submission cycle.