rusty1s / pytorch_scatter

PyTorch Extension Library of Optimized Scatter Operations
https://pytorch-scatter.readthedocs.io
MIT License
1.55k stars 179 forks source link

RuntimeError: Function 'torch::autograd::CopySlices' returned nan values in its 1th output. #178

Closed matteoTaiana closed 3 years ago

matteoTaiana commented 3 years ago

Hi everyone,

I am using the scatter_mean() function to update the embeddings of edges in a Graph Neural Network (I am using PyTorch Geometric for implementing the GNN). Things work fine for a variable number of epochs, then I get an error. I instrumented the code with the following instruction, so that error reporting is more informative: torch.autograd.set_detect_anomaly(True)

This is the error I get:

  File "/home/matteo/Code/PoseRefiner0/main.py", line 91, in <module>
    def main(_run, n_epochs, learning_rate, training_batch_size, perform_node_updates, optimizer_function, use_lr_scheduler,
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/experiment.py", line 190, in automain
    self.run_commandline()
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
    return self.run(
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
    run()
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/run.py", line 238, in __call__
    self.result = self.main_function(*args)
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
    result = wrapped(*args, **kwargs)
  File "/home/matteo/Code/PoseRefiner0/main.py", line 156, in main
    output = model(data)
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/matteo/Code/PoseRefiner0/pose_refiner.py", line 37, in forward
    x, edge_attr = self.msg_passer(x=x,                  # NOTE: This is different on purpose.
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/matteo/Code/PoseRefiner0/msg_passing.py", line 47, in forward
    updated_edge_attr = self.update_edges(x=x, edge_index=edge_index, edge_attr=edge_attr,
  File "/home/matteo/Code/PoseRefiner0/msg_passing.py", line 132, in update_edges
    updated_edge_attr[cum_edges[g_id]:cum_edges[g_id+1], :] = scatter_mean(single_updates, dim=0, index=b)
 (function print_stack)
ERROR - PoseRefiner - Failed after 0:39:09!
Traceback (most recent calls WITHOUT Sacred internals):
  File "/home/matteo/Code/PoseRefiner0/main.py", line 173, in main
    total_loss.backward()
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function 'torch::autograd::CopySlices' returned nan values in its 1th output.

In summary, this is the error: RuntimeError: Function 'torch::autograd::CopySlices' returned nan values in its 1th output.

And this is the instruction (executed during the forward pass) that leads to the error happening during the backward() function: updated_edge_attr[cum_edges[g_id]:cum_edges[g_id+1], :] = scatter_mean(single_updates, dim=0, index=b)

I don't understand the error message. The error happens while running the backward function, so while computing gradients. The function CopySlices seems to be simply selecting which part of the output array the output of scatter_mean() gets copied to, so the local gradient should be 1. Could this be due to the scatter_mean() function? Could this error be due to me writing several times to updated_edge_attr? Is that an in-place operation?

Thank you in advance for your help!

rusty1s commented 3 years ago

That's actually hard for me to track down. I guess this is caused by inplace modifying updated_edge_attr. To test, you can replace this call with torch.index_put.

github-actions[bot] commented 3 years ago

This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity.