1) What actionable information do I get from the worker logs (what does it mean if time per step is really high or throughput is low?) Example of cloud running: what does this log say about my model or what I should do? I don't see the story with the worker logs. Also, if a worker dies and restarts, I think it's counter will start at 0, so Steps are detached from 'global step'
Basic change: master and worker only hooks. Works well locally. Example of local run:
Things I'm still not happy with:
1) What actionable information do I get from the worker logs (what does it mean if time per step is really high or throughput is low?) Example of cloud running: what does this log say about my model or what I should do? I don't see the story with the worker logs. Also, if a worker dies and restarts, I think it's counter will start at 0, so Steps are detached from 'global step'
2) Printing loss or eval stats is useful. Maybe just master should print anything and workers print nothing?