iHeartGraph / Euler

48 stars 10 forks source link

RuntimeError: requires_grad not set on root #3

Closed luisfredgs closed 2 years ago

luisfredgs commented 2 years ago

Hello. Thank you for sharing your work with us. I'm facing an error when trying to run with param "-i PRED" on LANL dataset. I guess the error RuntimeError: requires_grad not set on root suggest we shall set requires_grad=True on euler_predictor.py when calculating the loss. I'm not sure. Can you help me?

The command: $ python run.py -t 5 -d 0.5 -e SAGE -r LSTM -i PRED

The output and error:

Namespace(dataset='LANL', delta=0.5, encoder='SAGE', fpweight=0.6, hidden=32, impl='PRED', load=False, lr=0.005, ngrus=1, nowrite=False, patience=5, rnn='LSTM', te_end=None, tests=5, threads=1, workers=8, zdim=16)
SAGE -> LSTM (PRED)
{'tr_start': 0, 'tr_end': 3, 'val_start': 3, 'val_end': 42, 'te_times': [(3, 2557047)], 'delta': 1800}
Tasks: [1]
worker0 loading 0-3
0 0 400000
Finding start: 0it [00:01, ?it/s]
Seconds read: 151035it [00:01, 98226.03it/s]                                                                                                                                 
worker0 is head
forward
backward
[W tensorpipe_agent.cpp:682] RPC agent for worker0 encountered error when reading incoming request from master: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
Traceback (most recent call last):
  File "run.py", line 201, in <module>
    stats = [
  File "run.py", line 202, in <listcomp>
    run_all(
  File "/home/luisfredgs/Documents/AA/BB/EULER/lanl_experiments/spinup.py", line 535, in run_all
    mp.spawn(
  File "/home/luisfredgs/venvs/Euler/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/luisfredgs/venvs/Euler/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/luisfredgs/venvs/Euler/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/luisfredgs/venvs/Euler/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/luisfredgs/Documents/AA/BB/EULER/lanl_experiments/spinup.py", line 183, in init_procs
    model, h0, tpe = train(rrefs, tr_args, rnn_constructor, rnn_args, impl)
  File "/home/luisfredgs/Documents/AA/BB/EULER/lanl_experiments/spinup.py", line 250, in train
    dist_autograd.backward(context_id, loss)
RuntimeError: requires_grad not set on root
Exception raised from validateRootsAndRetrieveEdges at ../torch/csrc/distributed/autograd/engine/dist_engine.cpp:159 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f4f95d0bf72 in /home/luisfredgs/venvs/Euler/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x5f (0x7f4f95d086bf in /home/luisfredgs/venvs/Euler/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: torch::distributed::autograd::DistEngine::validateRootsAndRetrieveEdges(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&) + 0x28f (0x7f4f67b4675f in /home/luisfredgs/venvs/Euler/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: torch::distributed::autograd::DistEngine::execute(long, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool) + 0xad (0x7f4f67b497fd in /home/luisfredgs/venvs/Euler/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::distributed::autograd::backward(long, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool) + 0x177 (0x7f4f67b31ee7 in /home/luisfredgs/venvs/Euler/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x7fefff (0x7f4f930cbfff in /home/luisfredgs/venvs/Euler/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x204d52 (0x7f4f92ad1d52 in /home/luisfredgs/venvs/Euler/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: PyCFunction_Call + 0x57 (0x521df7 in /home/luisfredgs/venvs/Euler/bin/python)
frame #8: _PyObject_MakeTpCall + 0x313 (0x50c503 in /home/luisfredgs/venvs/Euler/bin/python)
frame #9: _PyEval_EvalFrameDefault + 0x4ff4 (0x508384 in /home/luisfredgs/venvs/Euler/bin/python)
frame #10: _PyFunction_Vectorcall + 0x10f (0x51434f in /home/luisfredgs/venvs/Euler/bin/python)
frame #11: _PyEval_EvalFrameDefault + 0x3a3 (0x503733 in /home/luisfredgs/venvs/Euler/bin/python)
frame #12: _PyFunction_Vectorcall + 0x10f (0x51434f in /home/luisfredgs/venvs/Euler/bin/python)
frame #13: PyObject_Call + 0x25a (0x52410a in /home/luisfredgs/venvs/Euler/bin/python)
frame #14: _PyEval_EvalFrameDefault + 0x234f (0x5056df in /home/luisfredgs/venvs/Euler/bin/python)
frame #15: _PyFunction_Vectorcall + 0x10f (0x51434f in /home/luisfredgs/venvs/Euler/bin/python)
frame #16: PyObject_Call + 0x25a (0x52410a in /home/luisfredgs/venvs/Euler/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x234f (0x5056df in /home/luisfredgs/venvs/Euler/bin/python)
frame #18: _PyFunction_Vectorcall + 0x10f (0x51434f in /home/luisfredgs/venvs/Euler/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x6c6 (0x503a56 in /home/luisfredgs/venvs/Euler/bin/python)
frame #20: _PyEval_EvalCodeWithName + 0x2fa (0x5021ca in /home/luisfredgs/venvs/Euler/bin/python)
frame #21: _PyFunction_Vectorcall + 0x1ad (0x5143ed in /home/luisfredgs/venvs/Euler/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x6c6 (0x503a56 in /home/luisfredgs/venvs/Euler/bin/python)
frame #23: _PyFunction_Vectorcall + 0x10f (0x51434f in /home/luisfredgs/venvs/Euler/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x3a3 (0x503733 in /home/luisfredgs/venvs/Euler/bin/python)
frame #25: _PyEval_EvalCodeWithName + 0x2fa (0x5021ca in /home/luisfredgs/venvs/Euler/bin/python)
frame #26: _PyFunction_Vectorcall + 0x1ad (0x5143ed in /home/luisfredgs/venvs/Euler/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x117d (0x50450d in /home/luisfredgs/venvs/Euler/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x2fa (0x5021ca in /home/luisfredgs/venvs/Euler/bin/python)
frame #29: PyEval_EvalCode + 0x27 (0x5d6147 in /home/luisfredgs/venvs/Euler/bin/python)
frame #30: /home/luisfredgs/venvs/Euler/bin/python() [0x5f6f55]
frame #31: /home/luisfredgs/venvs/Euler/bin/python() [0x5f5f53]
frame #32: PyRun_StringFlags + 0x7f (0x5f2d7f in /home/luisfredgs/venvs/Euler/bin/python)
frame #33: PyRun_SimpleStringFlags + 0x3f (0x455fc5 in /home/luisfredgs/venvs/Euler/bin/python)
frame #34: Py_RunMain + 0x3c1 (0x5f1f41 in /home/luisfredgs/venvs/Euler/bin/python)
frame #35: Py_BytesMain + 0x2d (0x5ca1ed in /home/luisfredgs/venvs/Euler/bin/python)
frame #36: <unknown function> + 0x29d90 (0x7f4fa4395d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #37: __libc_start_main + 0x80 (0x7f4fa4395e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #38: _start + 0x25 (0x5ca0e5 in /home/luisfredgs/venvs/Euler/bin/python)

My environment:

zazyzaya commented 2 years ago

Hey Luis, I see you marked this as closed. Can you comment the fix in case others face the same issue?

luisfredgs commented 2 years ago

I made a mistake when applying some customizations on load_lanl.py that resulted in that bug. Then, I restored the code and it was solved :-).