Closed jaywonchung closed 1 year ago
@Rosie-m is the main reviewer for this PR, but I'd like @zyang37 to also take a look lightly to get a feel for the team's workflow.
Along with changes, I implemented a caching mechanism for metrics that basically stores energy/time numbers inside self._metric_cache
, and the cache dictionary is reset by zeus_ctx.reset()
. So the heavy pandas computation will only run once even if the user accesses zeus_ctx.total_energy
multiple times.
This PR implements
zeus.monitor.ZeusMonitorContext
, which is intended to be used by DNN training scripts to profile their per-iteration energy and time consumption.