openxla / xla

A machine learning compiler for GPUs, CPUs, and ML accelerators
Apache License 2.0
2.58k stars 401 forks source link

Hard error with PGLE Accruracy Checker #16569

Open shraiysh opened 2 weeks ago

shraiysh commented 2 weeks ago

In a recent commit, a hard error was introduced in PGLE when instructions are missing from the profile. The hard error has the following problems:

  1. If there is a conditional operation (if-else) in HLO, and it always jumps to the same branch, then the other computation never gets executed - hence will not have its costs in the profile. If we then use this profile for PGLE, it will crash.
  2. There are some modules that only run during initialization. If the profiler runs after a few training steps, then these modules do not run for the profiler. If we then use this profile for PGLE, it will crash.

Possible solutions:

  1. Hard error behind a flag like --strict-pgle. The default behavior should not have a hard error with missing instructions, instead it should use a default cost for them.
  2. Discuss a testing methodology that should be able to test the effect of PGLE on HLO schedule so we do not have to have this hard error at all.
shraiysh commented 2 weeks ago

@golechwierowicz

abhinavgoel95 commented 2 weeks ago

cc @Tixxx for viz.

golechwierowicz commented 1 week ago

Hey,

As a mitigation would disabling the pass work for you? It is implemented as such: https://github.com/tensorflow/tensorflow/pull/73606/files.

As for the problems: Re.1. We did not think this would be an issue in practice. But for cases like this, as an escape hatch we wanted to rely on pass disabling mechanism. Re.2. Can you not feed the profile to initialization modules at all? We have tested this with JAX_ENABLE_PGLE feature and it worked well.

As for the solutions: In 3P this proves to be an issue so for the long term plan, I think we can go with solution 1 and issue a relevant warning as a default. A testing methodology is a separate topic and should be pursued in parallel nonetheless.

cc: @frgossen @cheshire