Data Diagnosis command is not idempotent when diagnosis rule file uses failure_check function

jorgeesg commented 3 months ago

What's the issue, what's expected?: Given a baseline file and a diagnosis rule file, the generated diagnosis_summary report varies between executions. The inconsistent diagnosis behavior occurs when using the "failure_check" function in the diagnosis rule file.

How to reproduce it?:

Have a superbench results jsonl file with multiple nodes data if possible, to facilitate the testing.
Have your diagnosis rule file which specifically uses the "failure_check" function for some rules. For example, use it for a rule that verifies the return code metrics, much like the example in the documentation -> https://microsoft.github.io/superbenchmark/docs/user-tutorial/data-diagnosis/
Have a baseline file to use with data diagnosis.
Using your terminal, run superbench data diagnosis multiple times. Make sure to use the "--output-all" flag to see the status report for all your nodes. You will see that log messages in your terminal will be inconsistent between runs, even though you did not change any inputs, you are simply re-running the command. Attempt to run it at least 10 consecutive times to be sure that the inconsistent behavior shows up.
In between runs, check the diagnosis_report. You will see that certain nodes status will vary between accepted and failed states, due to the return code metrics.
Sometimes the report will be accurate and properly use the return code metrics, while other times it will erroneously say all nodes are bad due to the return code metrics, even though the return code was correct for some of the nodes (0).

Logs and snapshots: When return code metrics are properly used by the data diagnosis process, you will always see two log messages, like in this picture. One for data_diagnosis.py line 265, and the other one for line 330.

In contrast, this second image shows how logs look like when superbench does not correctly use the return code diagnosis rules. It will just mark all nodes as bad, using all the return code metrics.

Additional information: SB version - 0.10 This bad behavior's current workaround is to NOT use the "failure_check" function and instead replace it with "value". However, users using "failure_check" may be unaware of this behavior.

jorgeesg commented 2 months ago

Note, upon further testing, this behavior inconsistency can also be triggered even if there's no usage of failure_check rules in the diagnosis rule file.

yukirora commented 1 month ago

Hi, thanks for reaching us! Could you please provide more information for us to reproduce the issue, including the raw data file, rule file, command and the pandas version?

cp5555 commented 1 month ago

Hi @jorgeesg, do you still have this issue? If not, we will close it.

jorgeesg commented 1 month ago

Hello @cp5555 and @yukirora I can re-test this and gather the relevant information and post back early next week about this issue. Thanks.

yukirora commented 2 weeks ago

Hi Jorge, we have merged PR #638 to fix the issue, this issue is going to be close, please let us know if you have more questions.

jorgeesg commented 2 weeks ago

Thank you very much for the support and help :)

microsoft / superbenchmark

Data Diagnosis command is not idempotent when diagnosis rule file uses failure_check function #626