Open TheKanter opened 4 years ago
One idea is to start power measurement at the 8 or 16 chip scale -- roughly one "box". All benchmarks appear to run >5mins at this scale. We can look at other options for larger scales in the future.
Backlog since no power in v0.7
Need run-time >5min in order to get good estimate of steady state perf otherwise power draw can exceed TDP for short periods. So not really a fair measurement.
E.g. scaling across many nodes can have short run-times => stay in the “cold DVFS” regime (related to temperature of heat sink or cooling solution)
Problem is that at-scale runtime measurement is no longer representative (assuming goal is steady state measurement).
Can we mitigate this through rules for how to run these systems?
Do we all agree with this assumption?
Mitigation
Testing protocol which requires warm-up period which exercises hw and burns power, followed by measurement period
Require X warm-up runs to ensure that total run-time exceeds minimum length
Or turn off dynamic clock (DK believes this is a bad bad idea!)
Chip vendors would need to provide guidance about this protocol
Minimum run time will vary based on cooling solution (e.g., liquid vs. air-cooled)
Biggest downside is increased complexity for groups which perform the submission runs (and it’s already complex)
Complex how it would interact with caching policy (which requires cold start) in the case of “do X runs first”
DK: Discussion with expert confirms that 5 minute warm up period would work for air-cooled system. Must find details for liquid cooled systems.
AIs:
DK to talk to liquid cooling people Vendors to talk to internal power management experts, please ask about liquid cooled in particular