Open Guptajakala opened 5 years ago
Can you update torch-scatter
to its latest version and see if this fixes the issue?
@rusty1s Use pip install update did not solve. Is 1.3.1 the latest?
>>> import torch_scatter
>>> print(torch_scatter.__version__)
1.3.1
Yes, it is the latest. Mh, this is weird. I tested with different batch sizes without any problems. Please do me a favor and run your code with CUDA_LAUNCH_BLOCKING=1
and report back the error.
@rusty1s run the code with CUDA_LAUNCH_BLOCKING=1 gets the same output.
File "modelnet_test.py", line 178, in <module>
train()
File "modelnet_test.py", line 142, in train
out = model(data)
File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "modelnet_test.py", line 111, in forward
sa1_out = self.sa1_module(*sa0_out)
File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "modelnet_test.py", line 55, in forward
x = self.conv(x, (pos, pos[idx]), edge_index)
File "/home/gupta/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch_geometric/nn/conv/point_conv.py", line 66, in forward
return self.propagate(edge_index, x=x, pos=pos)
File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 121, in propagate
out = scatter_(self.aggr, out, edge_index[i], dim_size=size[i])
File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch_geometric/utils/scatter.py", line 33, in scatter_
out[out == fill_value] = 0
RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/THC/THCTensorMathCompare.cuh:82
Since I cannot reproduce this, I need your help with fixing this using print debugging :)
Can you modify utils/scatter.py
for me? Please do:
print(src.size(), index.size(), index.min(), index.max())
before the op
call in line 28. And please try to comment out line 32-33 and/or replace it with
out.masked_fill(out == fill_value, 0)
. Thank you.
Sure, I have changed to this. I cannot find any setup.py. Could you tell me how to install?
assert name in ['add', 'mean', 'max']
op = getattr(torch_scatter, 'scatter_{}'.format(name))
fill_value = -1e9 if name == 'max' else 0
print(src.size(), index.size(), index.min(), index.max())
out = op(src, index, 0, None, dim_size, fill_value)
if isinstance(out, tuple):
out = out[0]
# if name == 'max':
# out[out == fill_value] = 0
return out
There is a setup.py
in the root directory. Running python setup.py develop
should work :)
Oh, I was using pip to install pytorch_geometry so it was installed in ~/anaconda3/envs/gupta/lib/python3.6/site-packages/torch_geometric And there is no setup.py there.
Ah yes, you need to clone from GitHub, sorry :(
I found I dont need to reinstall, directly run my program already reflects the change:
torch.Size([1039749, 128]) torch.Size([1039749]) tensor(0, device='cuda:0') tensor(16350, device='cuda:0')
Traceback (most recent call last):
File "modelnet_test.py", line 178, in <module>
train()
File "modelnet_test.py", line 142, in train
out = model(data)
File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "modelnet_test.py", line 111, in forward
sa1_out = self.sa1_module(*sa0_out)
File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "modelnet_test.py", line 56, in forward
pos, batch = pos[idx], batch[idx]
RuntimeError: CUDA error: invalid argument
Excellent error messages :D I am totally unsure what may causes this issue :(
@rusty1s Hey, I found running with CUDA_LAUNCH_BLOCKING=1 still uses my GPU. I later hard coded the device to be CPU and the error disappears. What does this reflect?
Another clue, I found when using GPU, any random batchsize<16 works fine, although batchsize=16, my GPU memory is far from fully occupied. How can this be explained?
CUDA_LAUNCH_BLOCKING disables asynchronous GPU execution, so it will still use the GPU. It is generally quire useful to track down errors in your code.
Did you do any other modifications to the example? On which category do you test? Are you using an older version of pointnet2_segmentation.py and pointner2_classification.py? Does the classification example work?
Since segmentation uses some functions in classification, I copied the dependent part into segmentation file. The files are from version tag 1.3.0. Dataset is ShapeNet plane. I'm curious about if you run this piece of code, would the error be same? My reported error "RuntimeError: CUDA error: invalid argument" does not look related to GPU memory. This code runs well under batchsize 16.
import os.path as osp
import torch
import torch.nn.functional as F
from torch_geometric.datasets import ShapeNet
import torch_geometric.transforms as T
from torch_geometric.data import DataLoader
from torch_geometric.nn import knn_interpolate
from torch.nn import Sequential as Seq, Linear as Lin, ReLU, BatchNorm1d as BN
from torch_geometric.utils import mean_iou
from torch_geometric.nn import PointConv, fps, radius
from torch_geometric.utils import scatter_
category = 'Airplane'
path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'ShapeNet')
transform = T.Compose([
T.RandomTranslate(0.01),
T.RandomRotate(15, axis=0),
T.RandomRotate(15, axis=1),
T.RandomRotate(15, axis=2)
])
pre_transform = T.NormalizeScale()
train_dataset = ShapeNet(
path,
category,
train=True,
transform=transform,
pre_transform=pre_transform)
test_dataset = ShapeNet(
path, category, train=False, pre_transform=pre_transform)
train_loader = DataLoader(
train_dataset, batch_size=32, shuffle=True, num_workers=14)
test_loader = DataLoader(
test_dataset, batch_size=32, shuffle=False, num_workers=14)
n_iter=0
class SAModule(torch.nn.Module):
def __init__(self, ratio, r, nn):
super(SAModule, self).__init__()
self.ratio = ratio
self.r = r
self.conv = PointConv(nn)
def forward(self, x, pos, batch):
idx = fps(pos, batch, ratio=self.ratio)
row, col = radius(
pos, pos[idx], self.r, batch, batch[idx], max_num_neighbors=64)
edge_index = torch.stack([col, row], dim=0)
x = self.conv(x, (pos, pos[idx]), edge_index)
# print('idx=',idx)
# print('idx type', idx.shape, idx.type())
# print('pos, batch', pos.shape, batch.shape, pos.type(), batch.type())
pos, batch = pos[idx], batch[idx]
return x, pos, batch
class GlobalSAModule(torch.nn.Module):
def __init__(self, nn):
super(GlobalSAModule, self).__init__()
self.nn = nn
def forward(self, x, pos, batch):
x = self.nn(torch.cat([x, pos], dim=1))
x = scatter_('max', x, batch)
pos = pos.new_zeros((x.size(0), 3))
batch = torch.arange(x.size(0), device=batch.device)
return x, pos, batch
def MLP(channels, batch_norm=True):
return Seq(*[
Seq(Lin(channels[i - 1], channels[i]), ReLU(), BN(channels[i]))
for i in range(1, len(channels))
])
class FPModule(torch.nn.Module):
def __init__(self, k, nn):
super(FPModule, self).__init__()
self.k = k
self.nn = nn
def forward(self, x, pos, batch, x_skip, pos_skip, batch_skip):
x = knn_interpolate(x, pos, pos_skip, batch, batch_skip, k=self.k)
if x_skip is not None:
x = torch.cat([x, x_skip], dim=1)
x = self.nn(x)
return x, pos_skip, batch_skip
class Net(torch.nn.Module):
def __init__(self, num_classes):
super(Net, self).__init__()
self.sa1_module = SAModule(0.2, 0.2, MLP([3, 64, 64, 128]))
self.sa2_module = SAModule(0.25, 0.4, MLP([128 + 3, 128, 128, 256]))
self.sa3_module = GlobalSAModule(MLP([256 + 3, 256, 512, 1024]))
self.fp3_module = FPModule(1, MLP([1024 + 256, 256, 256]))
self.fp2_module = FPModule(3, MLP([256 + 128, 256, 128]))
self.fp1_module = FPModule(3, MLP([128, 128, 128, 128]))
self.lin1 = torch.nn.Linear(128, 128)
self.lin2 = torch.nn.Linear(128, 128)
self.lin3 = torch.nn.Linear(128, num_classes)
def forward(self, data):
sa0_out = (data.x, data.pos, data.batch)
sa1_out = self.sa1_module(*sa0_out)
sa2_out = self.sa2_module(*sa1_out)
sa3_out = self.sa3_module(*sa2_out)
fp3_out = self.fp3_module(*sa3_out, *sa2_out)
fp2_out = self.fp2_module(*fp3_out, *sa1_out)
x, _, _ = self.fp1_module(*fp2_out, *sa0_out)
x = F.relu(self.lin1(x))
x = F.dropout(x, p=0.5, training=self.training)
x = self.lin2(x)
x = F.dropout(x, p=0.5, training=self.training)
x = self.lin3(x)
return F.log_softmax(x, dim=-1)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# device = torch.device('cpu')
model = Net(train_dataset.num_classes).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
def train():
global n_iter
model.train()
total_loss = correct_nodes = total_nodes = 0
for i, data in enumerate(train_loader):
n_iter+=1
data = data.to(device)
optimizer.zero_grad()
out = model(data)
loss = F.nll_loss(out, data.y)
loss.backward()
optimizer.step()
total_loss += loss.item()
correct_nodes += out.max(dim=1)[1].eq(data.y).sum().item()
total_nodes += data.num_nodes
if (i + 1) % 10 == 0:
print('[{}/{}] Loss: {:.4f}, Train Accuracy: {:.4f}'.format(
i + 1, len(train_loader), total_loss / 10,
correct_nodes / total_nodes))
total_loss = correct_nodes = total_nodes = 0
def test(loader):
global n_iter
model.eval()
correct_nodes = total_nodes = 0
ious = []
for data in loader:
data = data.to(device)
with torch.no_grad():
out = model(data)
pred = out.max(dim=1)[1]
correct_nodes += pred.eq(data.y).sum().item()
ious += [mean_iou(pred, data.y, test_dataset.num_classes, data.batch)]
total_nodes += data.num_nodes
return correct_nodes / total_nodes, torch.cat(ious, dim=0).mean().item()
for epoch in range(1, 31):
train()
acc, iou = test(test_loader)
print('Epoch: {:02d}, Acc: {:.4f}, IoU: {:.4f}'.format(epoch, acc, iou))
Works just fine for me. I wonder what happens if you replace the x = scatter_('max', x, batch)
call with x = scatter_('mean', x, batch)
.
@rusty1s
Just have a chance to test, the error is still
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/THC/THCTensorMathCompare.cuh line=82 error=11 : invalid argument
Traceback (most recent call last):
File "modelnet_test.py", line 185, in <module>
train()
File "modelnet_test.py", line 149, in train
out = model(data)
File "/home/bowen/anaconda3/envs/bowen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "modelnet_test.py", line 117, in forward
sa1_out = self.sa1_module(*sa0_out)
File "/home/bowen/anaconda3/envs/bowen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "modelnet_test.py", line 56, in forward
x = self.conv(x, (pos, pos[idx]), edge_index)
File "/home/bowen/anaconda3/envs/bowen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/bowen/anaconda3/envs/bowen/lib/python3.6/site-packages/torch_geometric/nn/conv/point_conv.py", line 66, in forward
return self.propagate(edge_index, x=x, pos=pos)
File "/home/bowen/anaconda3/envs/bowen/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 121, in propagate
out = scatter_(self.aggr, out, edge_index[i], dim_size=size[i])
File "/home/bowen/anaconda3/envs/bowen/lib/python3.6/site-packages/torch_geometric/utils/scatter.py", line 33, in scatter_
out[out == fill_value] = 0
RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/THC/THCTensorMathCompare.cuh:82
Ok, maybe its related to this issue. Can you please try to install torch-geometric
from master? This issue may be already fixed there.
When I change the batchsize in [line] (https://github.com/rusty1s/pytorch_geometric/blob/80341478210305809576923597af11cd1ed36eeb/examples/pointnet2_segmentation.py#L31): to any random number such as 32, it will run into bug:
But in original setting batchsize=12 works fine. This is kind of wierd since I checked the implementation and did not find anything else looks relevant to batch size.