msr-fiddle / pipedream

MIT License
379 stars 117 forks source link

Translation demo: Division by zero #49

Open sergei-mironov opened 4 years ago

sergei-mironov commented 4 years ago

I'm experiencing the below error which looks critical. I'm using revision f50827f with docker base nvcr.io/nvidia/pytorch:19.05-py3

Traceback (most recent call last):
  File "train.py", line 474, in <module>
    main()
  File "train.py", line 458, in main
    train_loss, train_perf = trainer.optimize(train_loader)
  File "/workspace/src/pipedream/profiler/translation/seq2seq/train/trainer.py", line 373, in optimize
    output = self.feed_data(data_loader, training=True)
  File "/workspace/src/pipedream/profiler/translation/seq2seq/train/trainer.py", line 330, in feed_data
    os.path.join("profiles", self.arch+'_2'))
  File "/workspace/src/pipedream/profiler/translation/seq2seq/train/trainer.py", line 42, in create_graph
    graph_creator.persist_graph(directory)
  File "../torchmodules/torchgraph/graph_creator.py", line 281, in persist_graph
    self.graph.render_bar_graphs_and_cdfs(directory)
  File "../../graph/graph.py", line 607, in render_bar_graphs_and_cdfs
    pdfs.append(((node.forward_compute_time + node.backward_compute_time) / (cdfs[-1][0] / 100.0),
ZeroDivisionError: float division by zero
sergei-mironov commented 4 years ago

Note: printing cdfs before line 603 shows

[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 11010048.0, 151617536.0], [0.0, 22544384.0, 303235072.0], [0.0, 45613056.0, 370409472.0], [0.0, 68681728.0, 370409472.0], [0.0, 80740352.0, 420773888.0], [0.0, 81002496.0, 420773888.0], [0.0, 92536832.0, 420773888.0], [0.0, 104071168.0, 420773888.0], [0.0, 116129792.0, 454361088.0], [0.0, 116391936.0, 454361088.0], [0.0, 127926272.0, 454361088.0], [0.0, 139460608.0, 454361088.0], [0.0, 150994944.0, 454361088.0], [0.0, 163053568.0, 487948288.0], [0.0, 163315712.0, 487948288.0], [0.0, 174850048.0, 487948288.0], [0.0, 186384384.0, 487948288.0], [0.0, 209401856.0, 529928192.0], [0.0, 220411904.0, 529928192.0], [0.0, 220674048.0, 529928192.0], [0.0, 220936192.0, 529928192.0], [0.0, 231946240.0, 529928192.0], [0.0, 242956288.0, 529928192.0], [0.0, 253966336.0, 529928192.0], [0.0, 265500672.0, 580292608.0], [0.0, 265762816.0, 580292608.0], [0.0, 276772864.0, 580292608.0], [0.0, 287782912.0, 580292608.0], [0.0, 298792960.0, 580292608.0], [0.0, 310327296.0, 630657024.0], [0.0, 310589440.0, 630657024.0], [0.0, 321599488.0, 630657024.0], [0.0, 332609536.0, 630657024.0], [0.0, 343619584.0, 630657024.0], [0.0, 354629632.0, 630657024.0], [0.0, 366163968.0, 681021440.0], [0.0, 366426112.0, 681021440.0], [0.0, 377436160.0, 681021440.0], [0.0, 388446208.0, 681021440.0], [0.0, 786442240.0, 832787040.0]]
deepakn94 commented 4 years ago

To solve this particular error, you can just comment out the render_bar_graphs_and_cdfs() method call...I am worried that the root cause is something else though -- seems like the times for each operator are all zeros?

sergei-mironov commented 4 years ago

I did some debugging and my guess is that assigns at line 324 are bypassed by the above 'continue' branches. summary_elem could already contain _time fields due to zero-initialization at the beginning of the training. Could you please check?

heguangxin commented 2 years ago

I did some debugging and my guess is that assigns at line 324 are bypassed by the above 'continue' branches. summary_elem could already contain _time fields due to zero-initialization at the beginning of the training. Could you please check?

i got the same problem: times for each operator are all zeros. Did you figure it out?