tensorlab / tensorfx

TensorFlow framework for training and serving machine learning models
Apache License 2.0
196 stars 41 forks source link

Hook update #13

Closed brandondutra closed 7 years ago

brandondutra commented 7 years ago

Basic change: master and worker only hooks. Works well locally. Example of local run:

INFO: Global steps: 1000; Duration: 1 sec; Throughput: 3643.1 instances/sec; Loss: 0.492 INFO: Global steps: 1100; Duration: 1 sec; Throughput: 3807.0 instances/sec; Loss: 0.509 INFO: Global steps: 1172; Evaluation metric: 0.800 INFO: Global steps: 1200; Duration: 1 sec; Throughput: 3175.7 instances/sec; Loss: 0.289 INFO: Global steps: 1300; Duration: 1 sec; Throughput: 3329.1 instances/sec; Loss: 0.387

Things I'm still not happy with:

1) What actionable information do I get from the worker logs (what does it mean if time per step is really high or throughput is low?) Example of cloud running: what does this log say about my model or what I should do? I don't see the story with the worker logs. Also, if a worker dies and restarts, I think it's counter will start at 0, so Steps are detached from 'global step'

I Run: 0.00 sec; Steps: 19100; Duration: 194 sec; Throughput: 491.5 instances/sec I Run: 0.00 sec; Steps: 24200; Duration: 203 sec; Throughput: 595.5 instances/sec I Run: 0.00 sec; Steps: 21200; Duration: 208 sec; Throughput: 508.3 instances/sec I Run: 0.00 sec; Steps: 18000; Duration: 207 sec; Throughput: 434.1 instances/sec I Run: 0.01 sec; Steps: 19200; Duration: 195 sec; Throughput: 491.9 instances/sec I Run: 0.00 sec; Steps: 24300; Duration: 203 sec; Throughput: 595.8 instances/sec I Run: 0.00 sec; Steps: 21300; Duration: 209 sec; Throughput: 508.8 instances/sec I Run: 0.00 sec; Steps: 18100; Duration: 208 sec; Throughput: 434.9 instances/sec I Run: 0.00 sec; Steps: 19300; Duration: 196 sec; Throughput: 492.3 instances/sec I Run: 0.02 sec; Steps: 24400; Duration: 204 sec; Throughput: 596.0 instances/sec I Run: 0.03 sec; Steps: 21400; Duration: 210 sec; Throughput: 509.2 instances/sec I Run: 0.00 sec; Steps: 18200; Duration: 208 sec; Throughput: 435.6 instances/sec I Run: 0.00 sec; Steps: 24500; Duration: 205 sec; Throughput: 596.1 instances/sec I Run: 0.00 sec; Steps: 19400; Duration: 196 sec; Throughput: 492.7 instances/sec I Global steps: 76706; Evaluation metric: 0.867 I Run: 0.00 sec; Steps: 21500; Duration: 211 sec; Throughput: 509.4 instances/sec I Global steps: 84402; Duration: 207 sec; Throughput: 2038.2 instances/sec; Loss: 0.000 I Run: 0.00 sec; Steps: 18300; Duration: 209 sec; Throughput: 436.2 instances/sec I Run: 0.00 sec; Steps: 24600; Duration: 206 sec; Throughput: 596.1 instances/sec I Run: 0.01 sec; Steps: 19500; Duration: 197 sec; Throughput: 493.0 instances/sec I Run: 0.00 sec; Steps: 21600; Duration: 211 sec; Throughput: 509.9 instances/sec I Run: 0.02 sec; Steps: 18400; Duration: 210 sec; Throughput: 437.0 instances/sec

2) Printing loss or eval stats is useful. Maybe just master should print anything and workers print nothing?

brandondutra commented 7 years ago

@nikhilk

brandondutra commented 7 years ago

closing this PR as the history is all messed up with this weekend's updates. Will make a new PR