Closed ZongweiZhou1 closed 5 years ago
Thanks for reporting. I will look into it, but may take a while since I currently do not have access to multi GPUs. I am wondering why the view call should have an effect on the device type of the tensor.
hello @ZongweiZhou1 @rusty1s
conclustion: perhaps it is MULTI-LAYER (of EdgeConv) that causes it...
MULTI-LAYER (of GCNConv) (Y) MULTI-LAYER (of EdgeConv) (X)
CPU and Single GPU: OK. (has been tested, alright)
my LONG experiments and LONG results ...
I try to reimplement the code provided from @ZongweiZhou1 here is the code (since he does not mention what x is in his code, I add x here.)
first: my enviroment:
OS: Ubuntu 16.04 Python: 3.6.4 PyTorch: 1.0.1.post2 PyG: 1.0.3
the multi-nvidia-gpu:
a 3-gpu workstation
import torch
from torch_geometric.nn import EdgeConv
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data
class MLP(nn.Module):
def __init__(self, F_in, F_out):
super(MLP, self).__init__()
self.mlp = nn.Sequential(
nn.Linear(2 * F_in, F_out),
nn.ReLU(inplace=True))
def forward(self, x):
return self.mlp(x)
class TestBlock(nn.Module):
def __init__(self, F_in, F_out):
super(TestBlock, self).__init__()
self.edge_conv1 = EdgeConv(MLP(F_in, F_out),aggr='mean')
self.edge_conv2 = EdgeConv(MLP(F_out, F_out), aggr='mean')
self.fc = nn.Linear(F_out, 16)
def forward(self, data):
x = data.x
edge_index = data.edge_index
x = self.edge_conv1(x, edge_index)
x = self.fc(self.edge_conv2(x, edge_index))
return x
d_cpu = torch.device('cpu')
d_cu1 = torch.device('cuda:1')
d_cu2 = torch.device('cuda:2')
block = TestBlock(4, 8).to(d_cu1)
x = torch.randn(50, 8)
edge_index = torch.tensor([[], []], dtype=torch.long)
data = Data(x=x, edge_index=edge_index)
y = block(data.to(d_cu1))
print (y)
print (y.shape)
show the result
1 (run via ipython, not python program)
all right
2 (run via python, i.e. python pyg_test.py)
it still works fine here.
what if i change the gpu?
in my code above, I mentioned d_cu2
(sorry... d_cu0
, i.e., the devicecuda:0
is using by another program which i cannot stop it.)
very nice.
and of course, with 3-gpu workstation, it cannot send data and model to the 4th cuda device.
Sorry, ... I have to attend the group meeting at 09:30 (UTC+8), more experiment about this issue will be tested with remote ssh-control.
oops. same problem when: 1 PyTorch 1.1.0 2 PyG 1.3.0 and all other libraries are the NEWEST.
NEW things: NOTE(WMF): this time, I use MobaXTerm to connect to that server:
ablation: I know that: modifying either .py or .cu may take a relative long time. before that, i think finding out what causes it is worth thinking. (for example, if it is the problem of pytorch, not pyg, then, ...)
show the code here.
first, i commented the MLP part. (showing that with no pytorch class relations, only with pyg)
second:
sorry, i notice that it cascaded mlp and edge_conv in the second x =
i.e.
x = self.fc( self.edge_conv2( x, edge_index))
I just use self.fc
, no self.edge_conv2
it works.
third (perhaps finally):
after trying all kinds of code permutations (styles)
, the last step is to split to short code
self.fc( self.edge_conv2 (
-> self.fc(
self.edge_conv2
my final code:
# pyg_test_device.py
# Other nn or Other devices. (for example, JUST cuda:0)
import torch
from torch_geometric.nn import EdgeConv
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data
d_cpu = torch.device('cpu')
d_cu0 = torch.device('cuda:0')
d_cu1 = torch.device('cuda:1')
d_cu2 = torch.device('cuda:2')
d_cu3 = torch.device('cuda:3')
class MLP(nn.Module):
def __init__(self, F_in, F_out):
super(MLP, self).__init__()
self.mlp = nn.Sequential(
nn.Linear(2 * F_in, F_out),
nn.ReLU(inplace=True))
def forward(self, x):
return self.mlp(x)
class TestBlock(nn.Module):
def __init__(self, F_in, F_out):
super(TestBlock, self).__init__()
self.mlp1 = MLP(F_in, F_out)
self.edge_conv1 = EdgeConv(self.mlp1, aggr='mean')
# self.edge_conv1 = EdgeConv(MLP(F_in, F_out),aggr='mean')
# comment this, and it works.
# self.mlp2 = MLP(F_out, F_out)
# self.edge_conv2 = EdgeConv(self.mlp2, aggr='mean')
# self.edge_conv2 = EdgeConv(MLP(F_out, F_out), aggr='mean')
self.fc = nn.Linear(F_out, 16)
def forward(self, data):
x = data.x
edge_index = data.edge_index
x = self.edge_conv1(x, edge_index)
# x = self.fc(x)
# x = self.edge_conv2(x, edge_index)
x = self.fc(x)
return x
# devices
block = TestBlock(4, 22).to(d_cu1)
x = torch.randn(50, 4)
edge_index = torch.tensor([[], []], dtype=torch.long)
data = Data(x=x, edge_index=edge_index)
y = block(data.to(d_cu1))
print (y)
print (y.shape)
fianlly... same with 1.
the init format of EdgeConv is different from that of GCNConv (or many kinds of GNNs), since EdgeConv ?? needs a built-in (defined) nn ??
code (add gcnconv)
# pyg_test_device.py
# Other nn or Other devices. (for example, JUST cuda:0)
import torch
from torch_geometric.nn import EdgeConv, GCNConv
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data
# devices
d_cpu = torch.device('cpu')
d_cu0 = torch.device('cuda:0')
d_cu1 = torch.device('cuda:1')
d_cu2 = torch.device('cuda:2')
d_cu3 = torch.device('cuda:3')
class MLP(nn.Module):
def __init__(self, F_in, F_out):
super(MLP, self).__init__()
self.mlp = nn.Sequential(
nn.Linear(2 * F_in, F_out),
nn.ReLU(inplace=True))
def forward(self, x):
return self.mlp(x)
class TestBlock_edge(nn.Module):
def __init__(self, F_in, F_out):
super(TestBlock_edge, self).__init__()
self.mlp1 = MLP(F_in, F_out)
self.edge_conv1 = EdgeConv(self.mlp1, aggr='mean')
# self.edge_conv1 = EdgeConv(MLP(F_in, F_out),aggr='mean')
# comment this, and it works.
# PERHAPS, it is the
# self.mlp2 = MLP(F_out, F_out)
# self.edge_conv2 = EdgeConv(self.mlp2, aggr='mean')
# self.edge_conv2 = EdgeConv(MLP(F_out, F_out), aggr='mean')
self.fc = nn.Linear(F_out, 16)
def forward(self, data):
x = data.x
edge_index = data.edge_index
x = self.edge_conv1(x, edge_index)
# x = self.fc(x)
# x = self.edge_conv2(x, edge_index)
x = self.fc(x)
return x
class TestBlock_gcn(nn.Module):
def __init__(self, F_in, F_out):
super(TestBlock_gcn, self).__init__()
self.gcn1 = GCNConv(F_in, F_out)
self.gcn2 = GCNConv(F_out, F_out)
# self.edge_conv1 = EdgeConv(MLP(F_in, F_out),aggr='mean')
# comment this, and it works.
# PERHAPS, it is the
# self.mlp2 = MLP(F_out, F_out)
# self.edge_conv2 = EdgeConv(self.mlp2, aggr='mean')
# self.edge_conv2 = EdgeConv(MLP(F_out, F_out), aggr='mean')
self.fc = nn.Linear(F_out, 30)
def forward(self, data):
x, edge_index = data.x, data.edge_index
# edge_index = data.edge_index
x = self.gcn1(x, edge_index)
x = self.gcn2(x, edge_index)
# x = self.fc(x)
# x = self.edge_conv2(x, edge_index)
x = self.fc(x)
return x
block = TestBlock_gcn(4, 22).to(d_cu1)
x = torch.randn(50, 4)
edge_index = torch.tensor([[], []], dtype=torch.long)
data = Data(x=x, edge_index=edge_index)
y = block(data.to(d_cu1))
print (y)
print (y.shape)
gcnconv all right.
sorry, it is too-long. I think my reply may help you, ... a little.
I will look into edge_conv. and manage to hack a way of using MULTI-LAYER edge conv in multi-gpus. (cuda:1, ...)
yours sincerely: @wmf1997
p.s. @ZongweiZhou1 in most of time, edge_index may not be [[],[]]. (i.e. may not be empty...) while testing... (meaningful datasets, or random edge_index)
indeed, after adding something to the null edge_index, it works. (no error, and showing the device is cuda:1
.
yours sincerely: @WMF1997
@WMF1997 @rusty1s
Thanks so much for this long debugging process. :)
I also find the problem is not with pytorch. Instead, The device of the data of an empty edge_index is changed after feeding into the EdgeConv when it is the first sample given to the EdgeConv. For example:
import torch
from torch_geometric.nn import EdgeConv
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data
d1 = torch.device('cuda:1')
net = EdgeConv(nn.Linear(2*2,4), aggr='mean').to(d1)
x = torch.rand(4,2)
edge_index = torch.tensor([[], []], dtype=torch.long)
data = Data(x=x, edge_index=edge_index).to(d1)
print('the devices of edge_index before fed into the edgeconv: {}'.format(data.edge_index.device))
print('the devices of the data of edge_index before fed into the edgeconv: {}'.format(data.edge_index.data.device))
y = net(data.x, data.edge_index)
print('the devices of edge_index after fed into the edgeconv: {}'.format(data.edge_index.device))
print('the devices of the data of edge_index after fed into the edgeconv: {}'.format(data.edge_index.data.device))
The outputs are:
If the edge_index of the first sample sent to the EdgeConv is not empty, while of the second sample is empty, the bug is gone?
import torch
from torch_geometric.nn import EdgeConv
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data
d1 = torch.device('cuda:1')
net = EdgeConv(nn.Linear(2*2,4), aggr='mean').to(d1)
# first inference
x = torch.rand(4,2)
edge_index = torch.tensor([[0], [1]], dtype=torch.long)
data = Data(x=x, edge_index=edge_index).to(d1)
print('the devices of edge_index before fed into the edgeconv: {}'.format(data.edge_index.device))
print('the devices of the data of edge_index before fed into the edgeconv: {}'.format(data.edge_index.data.device))
y = net(data.x, data.edge_index)
print('the devices of edge_index after fed into the edgeconv: {}'.format(data.edge_index.device))
print('the devices of the data of edge_index after fed into the edgeconv: {}'.format(data.edge_index.data.device))
# second inference
x = torch.rand(4,2)
edge_index = torch.tensor([[], []], dtype=torch.long)
data = Data(x=x, edge_index=edge_index).to(d1)
print('\nthe devices of edge_index before fed into the edgeconv: {}'.format(data.edge_index.device))
print('the devices of the data of edge_index before fed into the edgeconv: {}'.format(data.edge_index.data.device))
y = net(data.x, data.edge_index)
print('the devices of edge_index after fed into the edgeconv: {}'.format(data.edge_index.device))
print('the devices of the data of edge_index after fed into the edgeconv: {}'.format(data.edge_index.data.device))
The output is:
B.T.W, is the problem of the version of pyg ?? My pyg version is 1.3.0
. As you posted, it works fine with 1.0.3
??
p.s. @ZongweiZhou1 in most of time, edge_index may not be [[],[]]. (i.e. may not be empty...) while testing... (meaningful datasets, or random edge_index)
indeed, after adding something to the null edge_index, it works. (no error, and showing the device is
cuda:1
.yours sincerely: @WMF1997
In my case, I need to judge whether there is an associated edge based on the spatial distance. And the point distribution is sometimes very scattered. At that time, edge_index
is null ... introduces bugs ...
hello @ZongweiZhou1
I have a small thought. edge_index is null... I have a thought that:
adding self loops( [[0,1,2],[0,1,2]] for 3 points)? (the distance from source point to source point must be 0) then edge_index
is not null?
yours sincerely: @WMF1997
P.S.
I also tested in PyG 1.3.0
(Windows window, I use remote control to the workstation which i re-installed PyG. ) (i.e. I upgraded pyg and other libraries. )
hello @ZongweiZhou1 I have a small thought. edge_index is null... I have a thought that: adding self loops( [[0,1,2],[0,1,2]] for 3 points)? (the distance from source point to source point must be 0) then
edge_index
is not null?yours sincerely: @WMF1997
P.S. I also tested in PyG
1.3.0
(Windows window, I use remote control to the workstation which i re-installed PyG. ) (i.e. I upgraded pyg and other libraries. )
Yeah, maybe that is a solution and I'll have a try. Thank you for your help~ @WMF1997
I fixed the bug in torch-scatter
master. In the end, this is not a PyTorch Geometric bug but rather a PyTorch bug, where view()
seems to make stupid things when applied on zero-element tensors.
I do not think that bug is critical, so I wait till releasing a new torch-scatter
release. Thank you very much!
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
def gen(src, index, dim=-1, out=None, dim_size=None, fill_value=0): dim = range(src.dim())[dim] # Get real dim value.
edge_index = torch.tensor([[0], [1]], dtype=torch.long) data = Data(x=x, edge_index=edge_index) y = block(data.to(d)) edge_index = torch.tensor([[], []], dtype=torch.long) data = Data(x=x, edge_index=edge_index) y = block(data.to(d)) print(y.device) print(data.edge_index.data)
cuda:1 tensor([], device='cuda:1', size=(2, 0), dtype=torch.int64)