pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.01k stars 3.62k forks source link

device of 'edge_index' is changed when' edge_index' is empty #629

Closed ZongweiZhou1 closed 5 years ago

ZongweiZhou1 commented 5 years ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

import torch
from torch_geometric.nn import EdgeConv
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data

class MLP(nn.Module):
    def __init__(self, F_in, F_out):
        super(MLP, self).__init__()
        self.mlp = nn.Sequential(
            nn.Linear(2 * F_in, F_out),
            nn.ReLU(inplace=True))

    def forward(self, x):
        return self.mlp(x)

class TestBlock(nn.Module):
    def __init__(self, F_in, F_out):
        super(TestBlock, self).__init__()
        self.edge_conv1 = EdgeConv(MLP(F_in, F_out),aggr='mean')
        self.edge_conv2 = EdgeConv(MLP(F_out, F_out), aggr='mean') 
        self.fc = nn.Linear(F_out, 16)

    def forward(self, data):
        x = data.x
        edge_index = data.edge_index 
        x = self.edge_conv1(x, edge_index)
        x = self.fc(self.edge_conv2(x, edge_index))
        return x

d = torch.device('cuda:1')
block = TestBlock(4, 8).to(d)
edge_index = torch.tensor([[], []], dtype=torch.long)
data = Data(x=x, edge_index=edge_index)
y = block(data.to(d))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-34-d1a9b2a113f2> in <module>
      1 edge_index = torch.tensor([[], []], dtype=torch.long)
      2 data = Data(x=x, edge_index=edge_index)
----> 3 y = block(data.to(d))
      4 print(y)
      5 print(data.edge_index.data)

~/miniconda3/envs/py36_pytorch110/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    491             result = self._slow_forward(*input, **kwargs)
    492         else:
--> 493             result = self.forward(*input, **kwargs)
    494         for hook in self._forward_hooks.values():
    495             hook_result = hook(self, input, result)

<ipython-input-31-b450444a54a6> in forward(self, data)
     24         edge_index = data.edge_index
     25         x = self.edge_conv1(x, edge_index)
---> 26         x = self.fc(self.edge_conv2(x, edge_index))
     27         return x

~/miniconda3/envs/py36_pytorch110/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    491             result = self._slow_forward(*input, **kwargs)
    492         else:
--> 493             result = self.forward(*input, **kwargs)
    494         for hook in self._forward_hooks.values():
    495             hook_result = hook(self, input, result)

~/miniconda3/envs/py36_pytorch110/lib/python3.6/site-packages/torch_geometric/nn/conv/edge_conv.py in forward(self, x, edge_index)
     42         x = x.unsqueeze(-1) if x.dim() == 1 else x
     43 
---> 44         return self.propagate(edge_index, x=x)
     45 
     46     def message(self, x_i, x_j):

~/miniconda3/envs/py36_pytorch110/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py in propagate(self, edge_index, size, **kwargs)
     99                         raise ValueError(__size_error_msg__)
    100 
--> 101                     tmp = torch.index_select(tmp, 0, edge_index[idx])
    102                     message_args.append(tmp)
    103             else:

RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/THC/generic/THCTensorIndex.cu:521
'''

## Expected behavior

<!-- A clear and concise description of what you expected to happen. -->

## Environment

* OS:   Linux 16.04
* Python version:  3.6
* PyTorch version: 1.1.0
* CUDA/cuDNN version:  9.0
* GCC version:  5.4.0
* Any other relevant information:

## Additional context

<!-- Add any other context about the problem here. -->
Maybe there are some problems in initialization, and I find some interesting thing in the function:
 `gen` in  `torch_scatter/utils/gen.py`

def gen(src, index, dim=-1, out=None, dim_size=None, fill_value=0): dim = range(src.dim())[dim] # Get real dim value.

# Automatically expand index tensor to the right dimensions.
if index.dim() == 1:
    index_size = list(repeat(1, src.dim()))
    index_size[dim] = src.size(dim)
    index = index.view(index_size).expand_as(src)

# Generate output tensor if not given.
if out is None:
    out_size = list(src.size())
    dim_size = maybe_dim_size(index, dim_size)
    out_size[dim] = dim_size
    out = src.new_full(out_size, fill_value)

return src, out, index, dim

The device of the index.data is 'cuda:1' before `index.view(index_size)` , but after that it changes to 'cuda:0' while the device of `index` is still 'cuda:1'. 

B.T.W, if the 'edge_index` of the first sample fed to the network is not empty , the bug is gone.

edge_index = torch.tensor([[0], [1]], dtype=torch.long) data = Data(x=x, edge_index=edge_index) y = block(data.to(d)) edge_index = torch.tensor([[], []], dtype=torch.long) data = Data(x=x, edge_index=edge_index) y = block(data.to(d)) print(y.device) print(data.edge_index.data)


The output:

cuda:1 tensor([], device='cuda:1', size=(2, 0), dtype=torch.int64)

rusty1s commented 5 years ago

Thanks for reporting. I will look into it, but may take a while since I currently do not have access to multi GPUs. I am wondering why the view call should have an effect on the device type of the tensor.

WMF1997 commented 5 years ago

hello @ZongweiZhou1 @rusty1s

conclustion: perhaps it is MULTI-LAYER (of EdgeConv) that causes it...

MULTI-LAYER (of GCNConv) (Y) MULTI-LAYER (of EdgeConv) (X)

CPU and Single GPU: OK. (has been tested, alright)

my LONG experiments and LONG results ...

I try to reimplement the code provided from @ZongweiZhou1 here is the code (since he does not mention what x is in his code, I add x here.)

first: my enviroment:

OS: Ubuntu 16.04 Python: 3.6.4 PyTorch: 1.0.1.post2 PyG: 1.0.3

the multi-nvidia-gpu:

2019-08-18 09-06-51屏幕截图

a 3-gpu workstation

import torch
from torch_geometric.nn import EdgeConv
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data

class MLP(nn.Module):
    def __init__(self, F_in, F_out):
        super(MLP, self).__init__()
        self.mlp = nn.Sequential(
            nn.Linear(2 * F_in, F_out),
            nn.ReLU(inplace=True))

    def forward(self, x):
        return self.mlp(x)

class TestBlock(nn.Module):
    def __init__(self, F_in, F_out):
        super(TestBlock, self).__init__()
        self.edge_conv1 = EdgeConv(MLP(F_in, F_out),aggr='mean')
        self.edge_conv2 = EdgeConv(MLP(F_out, F_out), aggr='mean') 
        self.fc = nn.Linear(F_out, 16)

    def forward(self, data):
        x = data.x
        edge_index = data.edge_index 
        x = self.edge_conv1(x, edge_index)
        x = self.fc(self.edge_conv2(x, edge_index))
        return x

d_cpu = torch.device('cpu')
d_cu1 = torch.device('cuda:1') 
d_cu2 = torch.device('cuda:2')

block = TestBlock(4, 8).to(d_cu1)
x = torch.randn(50, 8)
edge_index = torch.tensor([[], []], dtype=torch.long)
data = Data(x=x, edge_index=edge_index)
y = block(data.to(d_cu1))
print (y)
print (y.shape)

show the result

1 (run via ipython, not python program) 2019-08-18 09-15-39屏幕截图

all right

2 (run via python, i.e. python pyg_test.py) 2019-08-18 09-17-29屏幕截图

it still works fine here.

what if i change the gpu?

in my code above, I mentioned d_cu2 (sorry... d_cu0, i.e., the devicecuda:0 is using by another program which i cannot stop it.)

2019-08-18 09-20-37屏幕截图

very nice.

and of course, with 3-gpu workstation, it cannot send data and model to the 4th cuda device.

Sorry, ... I have to attend the group meeting at 09:30 (UTC+8), more experiment about this issue will be tested with remote ssh-control.

oops. same problem when: 1 PyTorch 1.1.0 2 PyG 1.3.0 and all other libraries are the NEWEST. 2019-08-18 09-45-20屏幕截图

NEW things: NOTE(WMF): this time, I use MobaXTerm to connect to that server:

ablation: I know that: modifying either .py or .cu may take a relative long time. before that, i think finding out what causes it is worth thinking. (for example, if it is the problem of pytorch, not pyg, then, ...)

show the code here.

first, i commented the MLP part. (showing that with no pytorch class relations, only with pyg)

image

second: sorry, i notice that it cascaded mlp and edge_conv in the second x = i.e.

x = self.fc( self.edge_conv2( x, edge_index))

I just use self.fc, no self.edge_conv2

image

it works.

third (perhaps finally): after trying all kinds of code permutations (styles), the last step is to split to short code self.fc( self.edge_conv2 ( -> self.fc( self.edge_conv2

my final code:

# pyg_test_device.py
# Other nn or Other devices. (for example, JUST cuda:0)

import torch
from torch_geometric.nn import EdgeConv

import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data

d_cpu = torch.device('cpu')

d_cu0 = torch.device('cuda:0')
d_cu1 = torch.device('cuda:1') 
d_cu2 = torch.device('cuda:2')
d_cu3 = torch.device('cuda:3')

class MLP(nn.Module):
    def __init__(self, F_in, F_out):
        super(MLP, self).__init__()
        self.mlp = nn.Sequential(
            nn.Linear(2 * F_in, F_out),
            nn.ReLU(inplace=True))

    def forward(self, x):
        return self.mlp(x)

class TestBlock(nn.Module):
    def __init__(self, F_in, F_out):
        super(TestBlock, self).__init__()
        self.mlp1 = MLP(F_in, F_out)
        self.edge_conv1 = EdgeConv(self.mlp1, aggr='mean')
        # self.edge_conv1 = EdgeConv(MLP(F_in, F_out),aggr='mean')

        # comment this, and it works. 
        # self.mlp2 = MLP(F_out, F_out)
        # self.edge_conv2 = EdgeConv(self.mlp2, aggr='mean')

        # self.edge_conv2 = EdgeConv(MLP(F_out, F_out), aggr='mean')         
        self.fc = nn.Linear(F_out, 16)

    def forward(self, data):
        x = data.x
        edge_index = data.edge_index 
        x = self.edge_conv1(x, edge_index)
        # x = self.fc(x)
        # x = self.edge_conv2(x, edge_index)
        x = self.fc(x)
        return x

# devices

block = TestBlock(4, 22).to(d_cu1)
x = torch.randn(50, 4)
edge_index = torch.tensor([[], []], dtype=torch.long)
data = Data(x=x, edge_index=edge_index)
y = block(data.to(d_cu1))
print (y)
print (y.shape)

image

fianlly... same with 1.

the init format of EdgeConv is different from that of GCNConv (or many kinds of GNNs), since EdgeConv ?? needs a built-in (defined) nn ??

code (add gcnconv)


# pyg_test_device.py
# Other nn or Other devices. (for example, JUST cuda:0)

import torch
from torch_geometric.nn import EdgeConv, GCNConv

import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data

# devices
d_cpu = torch.device('cpu')

d_cu0 = torch.device('cuda:0')
d_cu1 = torch.device('cuda:1') 
d_cu2 = torch.device('cuda:2')
d_cu3 = torch.device('cuda:3')

class MLP(nn.Module):
    def __init__(self, F_in, F_out):
        super(MLP, self).__init__()
        self.mlp = nn.Sequential(
            nn.Linear(2 * F_in, F_out),
            nn.ReLU(inplace=True))

    def forward(self, x):
        return self.mlp(x)

class TestBlock_edge(nn.Module):
    def __init__(self, F_in, F_out):
        super(TestBlock_edge, self).__init__()
        self.mlp1 = MLP(F_in, F_out)
        self.edge_conv1 = EdgeConv(self.mlp1, aggr='mean')

        # self.edge_conv1 = EdgeConv(MLP(F_in, F_out),aggr='mean')
        # comment this, and it works. 

        # PERHAPS, it is the  
        # self.mlp2 = MLP(F_out, F_out)
        # self.edge_conv2 = EdgeConv(self.mlp2, aggr='mean')

        # self.edge_conv2 = EdgeConv(MLP(F_out, F_out), aggr='mean')         
        self.fc = nn.Linear(F_out, 16)

    def forward(self, data):
        x = data.x
        edge_index = data.edge_index 
        x = self.edge_conv1(x, edge_index)
        # x = self.fc(x)
        # x = self.edge_conv2(x, edge_index)
        x = self.fc(x)
        return x

class TestBlock_gcn(nn.Module):
    def __init__(self, F_in, F_out):
        super(TestBlock_gcn, self).__init__()
        self.gcn1 = GCNConv(F_in, F_out)
        self.gcn2 = GCNConv(F_out, F_out)

        # self.edge_conv1 = EdgeConv(MLP(F_in, F_out),aggr='mean')
        # comment this, and it works. 

        # PERHAPS, it is the  
        # self.mlp2 = MLP(F_out, F_out)
        # self.edge_conv2 = EdgeConv(self.mlp2, aggr='mean')

        # self.edge_conv2 = EdgeConv(MLP(F_out, F_out), aggr='mean')         
        self.fc = nn.Linear(F_out, 30)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        # edge_index = data.edge_index 
        x = self.gcn1(x, edge_index)
        x = self.gcn2(x, edge_index)
        # x = self.fc(x)
        # x = self.edge_conv2(x, edge_index)
        x = self.fc(x)
        return x

block = TestBlock_gcn(4, 22).to(d_cu1)
x = torch.randn(50, 4)
edge_index = torch.tensor([[], []], dtype=torch.long)
data = Data(x=x, edge_index=edge_index)
y = block(data.to(d_cu1))
print (y)
print (y.shape)

image gcnconv all right.

sorry, it is too-long. I think my reply may help you, ... a little.

I will look into edge_conv. and manage to hack a way of using MULTI-LAYER edge conv in multi-gpus. (cuda:1, ...)

yours sincerely: @wmf1997

WMF1997 commented 5 years ago

p.s. @ZongweiZhou1 in most of time, edge_index may not be [[],[]]. (i.e. may not be empty...) while testing... (meaningful datasets, or random edge_index)

indeed, after adding something to the null edge_index, it works. (no error, and showing the device is cuda:1.

yours sincerely: @WMF1997

ZongweiZhou1 commented 5 years ago

@WMF1997 @rusty1s

Thanks so much for this long debugging process. :)

I also find the problem is not with pytorch. Instead, The device of the data of an empty edge_index is changed after feeding into the EdgeConv when it is the first sample given to the EdgeConv. For example:

import torch
from torch_geometric.nn import EdgeConv

import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data

d1 = torch.device('cuda:1')
net = EdgeConv(nn.Linear(2*2,4), aggr='mean').to(d1)

x = torch.rand(4,2)
edge_index = torch.tensor([[], []], dtype=torch.long)
data = Data(x=x, edge_index=edge_index).to(d1)

print('the devices of edge_index before fed into the edgeconv: {}'.format(data.edge_index.device))
print('the devices of the data of edge_index before fed into the edgeconv: {}'.format(data.edge_index.data.device))
y = net(data.x, data.edge_index)
print('the devices of edge_index after fed into the edgeconv: {}'.format(data.edge_index.device))
print('the devices of the data of edge_index after fed into the edgeconv: {}'.format(data.edge_index.data.device))

The outputs are:

image

If the edge_index of the first sample sent to the EdgeConv is not empty, while of the second sample is empty, the bug is gone?

import torch
from torch_geometric.nn import EdgeConv

import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data

d1 = torch.device('cuda:1')
net = EdgeConv(nn.Linear(2*2,4), aggr='mean').to(d1)

# first inference
x = torch.rand(4,2)
edge_index = torch.tensor([[0], [1]], dtype=torch.long)
data = Data(x=x, edge_index=edge_index).to(d1)
print('the devices of edge_index before fed into the edgeconv: {}'.format(data.edge_index.device))
print('the devices of the data of edge_index before fed into the edgeconv: {}'.format(data.edge_index.data.device))
y = net(data.x, data.edge_index)
print('the devices of edge_index after fed into the edgeconv: {}'.format(data.edge_index.device))
print('the devices of the data of edge_index after fed into the edgeconv: {}'.format(data.edge_index.data.device))

# second inference
x = torch.rand(4,2)
edge_index = torch.tensor([[], []], dtype=torch.long)
data = Data(x=x, edge_index=edge_index).to(d1)
print('\nthe devices of edge_index before fed into the edgeconv: {}'.format(data.edge_index.device))
print('the devices of the data of edge_index before fed into the edgeconv: {}'.format(data.edge_index.data.device))
y = net(data.x, data.edge_index)
print('the devices of edge_index after fed into the edgeconv: {}'.format(data.edge_index.device))
print('the devices of the data of edge_index after fed into the edgeconv: {}'.format(data.edge_index.data.device))

The output is:

image

B.T.W, is the problem of the version of pyg ?? My pyg version is 1.3.0. As you posted, it works fine with 1.0.3??

ZongweiZhou1 commented 5 years ago

p.s. @ZongweiZhou1 in most of time, edge_index may not be [[],[]]. (i.e. may not be empty...) while testing... (meaningful datasets, or random edge_index)

indeed, after adding something to the null edge_index, it works. (no error, and showing the device is cuda:1.

yours sincerely: @WMF1997

In my case, I need to judge whether there is an associated edge based on the spatial distance. And the point distribution is sometimes very scattered. At that time, edge_index is null ... introduces bugs ...

WMF1997 commented 5 years ago

hello @ZongweiZhou1 I have a small thought. edge_index is null... I have a thought that: adding self loops( [[0,1,2],[0,1,2]] for 3 points)? (the distance from source point to source point must be 0) then edge_index is not null?

yours sincerely: @WMF1997

P.S. I also tested in PyG 1.3.0 (Windows window, I use remote control to the workstation which i re-installed PyG. ) (i.e. I upgraded pyg and other libraries. )

image

ZongweiZhou1 commented 5 years ago

hello @ZongweiZhou1 I have a small thought. edge_index is null... I have a thought that: adding self loops( [[0,1,2],[0,1,2]] for 3 points)? (the distance from source point to source point must be 0) then edge_index is not null?

yours sincerely: @WMF1997

P.S. I also tested in PyG 1.3.0 (Windows window, I use remote control to the workstation which i re-installed PyG. ) (i.e. I upgraded pyg and other libraries. )

image

Yeah, maybe that is a solution and I'll have a try. Thank you for your help~ @WMF1997

rusty1s commented 5 years ago

I fixed the bug in torch-scatter master. In the end, this is not a PyTorch Geometric bug but rather a PyTorch bug, where view() seems to make stupid things when applied on zero-element tensors.

I do not think that bug is critical, so I wait till releasing a new torch-scatter release. Thank you very much!