mlcommons / power-dev

Dev repo for power measurement for the MLPerf™ benchmarks
https://mlcommons.org/en/groups/best-practices-power
Apache License 2.0
16 stars 22 forks source link

Time difference for 4 phases of run more than 500ms #320

Closed arjunsuresh closed 7 months ago

arjunsuresh commented 11 months ago

The below error has been seen many times in previous rounds and we are seeing it in about 5-10% of our power runs. In the interest of time we are just doing reruns, but from all our results, increasing the delta from 500ms to 1s should solve the issue. Below is an example log.

[2023-08-12 09:03:56,445 power_checker.py:753 INFO] [x] Check client sources checksum
[2023-08-12 09:03:56,445 power_checker.py:753 INFO] [x] Check server sources checksum
[2023-08-12 09:03:56,445 power_checker.py:753 INFO] [x] Check PTD commands and replies
[2023-08-12 09:03:56,445 power_checker.py:753 INFO] [x] Check UUID
[2023-08-12 09:03:56,445 power_checker.py:753 INFO] [x] Check session name
[2023-08-12 09:03:56,445 power_checker.py:740 ERROR] [ ] Check time difference
[2023-08-12 09:03:56,445 power_checker.py:741 ERROR]    The time difference for 4 phase of ranging mode is more than 500ms.Observed difference is 593.8124656677246ms
arjunsuresh commented 11 months ago
[2023-08-13 12:36:04,668 power_checker.py:753 INFO] [x] Check client sources checksum
[2023-08-13 12:36:04,668 power_checker.py:753 INFO] [x] Check server sources checksum
[2023-08-13 12:36:04,669 power_checker.py:753 INFO] [x] Check PTD commands and replies
[2023-08-13 12:36:04,669 power_checker.py:753 INFO] [x] Check UUID
[2023-08-13 12:36:04,669 power_checker.py:753 INFO] [x] Check session name
[2023-08-13 12:36:04,669 power_checker.py:740 ERROR] [ ] Check time difference
[2023-08-13 12:36:04,669 power_checker.py:741 ERROR]    The time difference for 4 phase of ranging mode is more than 500ms.Observed difference is 562.474250793457ms
arjunsuresh commented 11 months ago
[2023-08-13 12:36:04,859 power_checker.py:753 INFO] [x] Check client sources checksum
[2023-08-13 12:36:04,859 power_checker.py:753 INFO] [x] Check server sources checksum
[2023-08-13 12:36:04,859 power_checker.py:753 INFO] [x] Check PTD commands and replies
[2023-08-13 12:36:04,859 power_checker.py:753 INFO] [x] Check UUID
[2023-08-13 12:36:04,859 power_checker.py:753 INFO] [x] Check session name
[2023-08-13 12:36:04,859 power_checker.py:740 ERROR] [ ] Check time difference
[2023-08-13 12:36:04,859 power_checker.py:741 ERROR]    The time difference for 4 phase of ranging mode is more than 500ms.Observed difference is 533.1976413726807ms
arjunsuresh commented 9 months ago

PR to fix this: https://github.com/mlcommons/inference/pull/1496

arjunsuresh commented 7 months ago

Hopefully addressed here.