pointnet++ bug when change batch size

Guptajakala commented 5 years ago

When I change the batchsize in [line] (https://github.com/rusty1s/pytorch_geometric/blob/80341478210305809576923597af11cd1ed36eeb/examples/pointnet2_segmentation.py#L31): to any random number such as 32, it will run into bug:

Traceback (most recent call last):
  File "modelnet_test.py", line 178, in <module>
    train()
  File "modelnet_test.py", line 142, in train
    out = model(data)
  File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "modelnet_test.py", line 111, in forward
    sa1_out = self.sa1_module(*sa0_out)
  File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "modelnet_test.py", line 55, in forward
    x = self.conv(x, (pos, pos[idx]), edge_index)
  File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch_geometric/nn/conv/point_conv.py", line 66, in forward
    return self.propagate(edge_index, x=x, pos=pos)
  File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 121, in propagate
    out = scatter_(self.aggr, out, edge_index[i], dim_size=size[i])
  File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch_geometric/utils/scatter.py", line 33, in scatter_
    out[out == fill_value] = 0
RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/THC/THCTensorMathCompare.cuh:82

But in original setting batchsize=12 works fine. This is kind of wierd since I checked the implementation and did not find anything else looks relevant to batch size.

rusty1s commented 5 years ago

Can you update torch-scatter to its latest version and see if this fixes the issue?

Guptajakala commented 5 years ago

@rusty1s Use pip install update did not solve. Is 1.3.1 the latest?

>>> import torch_scatter
>>> print(torch_scatter.__version__)
1.3.1

rusty1s commented 5 years ago

Yes, it is the latest. Mh, this is weird. I tested with different batch sizes without any problems. Please do me a favor and run your code with CUDA_LAUNCH_BLOCKING=1 and report back the error.

Guptajakala commented 5 years ago

@rusty1s run the code with CUDA_LAUNCH_BLOCKING=1 gets the same output.

  File "modelnet_test.py", line 178, in <module>
    train()
  File "modelnet_test.py", line 142, in train
    out = model(data)
  File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "modelnet_test.py", line 111, in forward
    sa1_out = self.sa1_module(*sa0_out)
  File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "modelnet_test.py", line 55, in forward
    x = self.conv(x, (pos, pos[idx]), edge_index)
  File "/home/gupta/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch_geometric/nn/conv/point_conv.py", line 66, in forward
    return self.propagate(edge_index, x=x, pos=pos)
  File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 121, in propagate
    out = scatter_(self.aggr, out, edge_index[i], dim_size=size[i])
  File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch_geometric/utils/scatter.py", line 33, in scatter_
    out[out == fill_value] = 0
RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/THC/THCTensorMathCompare.cuh:82

rusty1s commented 5 years ago

Since I cannot reproduce this, I need your help with fixing this using print debugging :) Can you modify utils/scatter.py for me? Please do:

print(src.size(), index.size(), index.min(), index.max())

before the op call in line 28. And please try to comment out line 32-33 and/or replace it with out.masked_fill(out == fill_value, 0). Thank you.

Guptajakala commented 5 years ago

Sure, I have changed to this. I cannot find any setup.py. Could you tell me how to install?

  assert name in ['add', 'mean', 'max']

    op = getattr(torch_scatter, 'scatter_{}'.format(name))
    fill_value = -1e9 if name == 'max' else 0

    print(src.size(), index.size(), index.min(), index.max())
    out = op(src, index, 0, None, dim_size, fill_value)
    if isinstance(out, tuple):
        out = out[0]

    # if name == 'max':
    #     out[out == fill_value] = 0

    return out

rusty1s commented 5 years ago

There is a setup.py in the root directory. Running python setup.py develop should work :)

Guptajakala commented 5 years ago

Oh, I was using pip to install pytorch_geometry so it was installed in ~/anaconda3/envs/gupta/lib/python3.6/site-packages/torch_geometric And there is no setup.py there.

rusty1s commented 5 years ago

Ah yes, you need to clone from GitHub, sorry :(

Guptajakala commented 5 years ago

I found I dont need to reinstall, directly run my program already reflects the change:

torch.Size([1039749, 128]) torch.Size([1039749]) tensor(0, device='cuda:0') tensor(16350, device='cuda:0')
Traceback (most recent call last):
  File "modelnet_test.py", line 178, in <module>
    train()
  File "modelnet_test.py", line 142, in train
    out = model(data)
  File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "modelnet_test.py", line 111, in forward
    sa1_out = self.sa1_module(*sa0_out)
  File "/home/gupta/anaconda3/envs/gupta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "modelnet_test.py", line 56, in forward
    pos, batch = pos[idx], batch[idx]
RuntimeError: CUDA error: invalid argument

rusty1s commented 5 years ago

Excellent error messages :D I am totally unsure what may causes this issue :(

Guptajakala commented 5 years ago

@rusty1s Hey, I found running with CUDA_LAUNCH_BLOCKING=1 still uses my GPU. I later hard coded the device to be CPU and the error disappears. What does this reflect?

Guptajakala commented 5 years ago

Another clue, I found when using GPU, any random batchsize<16 works fine, although batchsize=16, my GPU memory is far from fully occupied. How can this be explained?

rusty1s commented 5 years ago

CUDA_LAUNCH_BLOCKING disables asynchronous GPU execution, so it will still use the GPU. It is generally quire useful to track down errors in your code.

Did you do any other modifications to the example? On which category do you test? Are you using an older version of pointnet2_segmentation.py and pointner2_classification.py? Does the classification example work?

Guptajakala commented 5 years ago

Since segmentation uses some functions in classification, I copied the dependent part into segmentation file. The files are from version tag 1.3.0. Dataset is ShapeNet plane. I'm curious about if you run this piece of code, would the error be same? My reported error "RuntimeError: CUDA error: invalid argument" does not look related to GPU memory. This code runs well under batchsize 16.

import os.path as osp
import torch
import torch.nn.functional as F
from torch_geometric.datasets import ShapeNet
import torch_geometric.transforms as T
from torch_geometric.data import DataLoader
from torch_geometric.nn import knn_interpolate
from torch.nn import Sequential as Seq, Linear as Lin, ReLU, BatchNorm1d as BN
from torch_geometric.utils import mean_iou
from torch_geometric.nn import PointConv, fps, radius
from torch_geometric.utils import scatter_

category = 'Airplane'
path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'ShapeNet')
transform = T.Compose([
    T.RandomTranslate(0.01),
    T.RandomRotate(15, axis=0),
    T.RandomRotate(15, axis=1),
    T.RandomRotate(15, axis=2)
])
pre_transform = T.NormalizeScale()
train_dataset = ShapeNet(
    path,
    category,
    train=True,
    transform=transform,
    pre_transform=pre_transform)
test_dataset = ShapeNet(
    path, category, train=False, pre_transform=pre_transform)
train_loader = DataLoader(
    train_dataset, batch_size=32, shuffle=True, num_workers=14)
test_loader = DataLoader(
    test_dataset, batch_size=32, shuffle=False, num_workers=14)

n_iter=0

class SAModule(torch.nn.Module):
    def __init__(self, ratio, r, nn):
        super(SAModule, self).__init__()
        self.ratio = ratio
        self.r = r
        self.conv = PointConv(nn)

    def forward(self, x, pos, batch):
        idx = fps(pos, batch, ratio=self.ratio)

        row, col = radius(
            pos, pos[idx], self.r, batch, batch[idx], max_num_neighbors=64)
        edge_index = torch.stack([col, row], dim=0)
        x = self.conv(x, (pos, pos[idx]), edge_index)

        # print('idx=',idx)
        # print('idx type', idx.shape, idx.type())
        # print('pos, batch', pos.shape, batch.shape, pos.type(), batch.type())

        pos, batch = pos[idx], batch[idx]
        return x, pos, batch

class GlobalSAModule(torch.nn.Module):
    def __init__(self, nn):
        super(GlobalSAModule, self).__init__()
        self.nn = nn

    def forward(self, x, pos, batch):
        x = self.nn(torch.cat([x, pos], dim=1))
        x = scatter_('max', x, batch)
        pos = pos.new_zeros((x.size(0), 3))
        batch = torch.arange(x.size(0), device=batch.device)
        return x, pos, batch

def MLP(channels, batch_norm=True):
    return Seq(*[
        Seq(Lin(channels[i - 1], channels[i]), ReLU(), BN(channels[i]))
        for i in range(1, len(channels))
    ])

class FPModule(torch.nn.Module):
    def __init__(self, k, nn):
        super(FPModule, self).__init__()
        self.k = k
        self.nn = nn

    def forward(self, x, pos, batch, x_skip, pos_skip, batch_skip):
        x = knn_interpolate(x, pos, pos_skip, batch, batch_skip, k=self.k)
        if x_skip is not None:
            x = torch.cat([x, x_skip], dim=1)
        x = self.nn(x)
        return x, pos_skip, batch_skip

class Net(torch.nn.Module):
    def __init__(self, num_classes):
        super(Net, self).__init__()
        self.sa1_module = SAModule(0.2, 0.2, MLP([3, 64, 64, 128]))
        self.sa2_module = SAModule(0.25, 0.4, MLP([128 + 3, 128, 128, 256]))
        self.sa3_module = GlobalSAModule(MLP([256 + 3, 256, 512, 1024]))

        self.fp3_module = FPModule(1, MLP([1024 + 256, 256, 256]))
        self.fp2_module = FPModule(3, MLP([256 + 128, 256, 128]))
        self.fp1_module = FPModule(3, MLP([128, 128, 128, 128]))

        self.lin1 = torch.nn.Linear(128, 128)
        self.lin2 = torch.nn.Linear(128, 128)
        self.lin3 = torch.nn.Linear(128, num_classes)

    def forward(self, data):
        sa0_out = (data.x, data.pos, data.batch)
        sa1_out = self.sa1_module(*sa0_out)
        sa2_out = self.sa2_module(*sa1_out)
        sa3_out = self.sa3_module(*sa2_out)

        fp3_out = self.fp3_module(*sa3_out, *sa2_out)
        fp2_out = self.fp2_module(*fp3_out, *sa1_out)
        x, _, _ = self.fp1_module(*fp2_out, *sa0_out)

        x = F.relu(self.lin1(x))
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.lin2(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.lin3(x)
        return F.log_softmax(x, dim=-1)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# device = torch.device('cpu')
model = Net(train_dataset.num_classes).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

def train():
    global n_iter

    model.train()

    total_loss = correct_nodes = total_nodes = 0
    for i, data in enumerate(train_loader):
        n_iter+=1
        data = data.to(device)
        optimizer.zero_grad()
        out = model(data)
        loss = F.nll_loss(out, data.y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        correct_nodes += out.max(dim=1)[1].eq(data.y).sum().item()
        total_nodes += data.num_nodes

        if (i + 1) % 10 == 0:
            print('[{}/{}] Loss: {:.4f}, Train Accuracy: {:.4f}'.format(
                i + 1, len(train_loader), total_loss / 10,
                correct_nodes / total_nodes))
            total_loss = correct_nodes = total_nodes = 0

def test(loader):
    global n_iter
    model.eval()

    correct_nodes = total_nodes = 0
    ious = []
    for data in loader:
        data = data.to(device)
        with torch.no_grad():
            out = model(data)
        pred = out.max(dim=1)[1]
        correct_nodes += pred.eq(data.y).sum().item()
        ious += [mean_iou(pred, data.y, test_dataset.num_classes, data.batch)]
        total_nodes += data.num_nodes
    return correct_nodes / total_nodes, torch.cat(ious, dim=0).mean().item()

for epoch in range(1, 31):
    train()
    acc, iou = test(test_loader)
    print('Epoch: {:02d}, Acc: {:.4f}, IoU: {:.4f}'.format(epoch, acc, iou))

rusty1s commented 5 years ago

Works just fine for me. I wonder what happens if you replace the x = scatter_('max', x, batch) call with x = scatter_('mean', x, batch).

Guptajakala commented 5 years ago

@rusty1s
Just have a chance to test, the error is still

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/THC/THCTensorMathCompare.cuh line=82 error=11 : invalid argument
Traceback (most recent call last):
  File "modelnet_test.py", line 185, in <module>
    train()
  File "modelnet_test.py", line 149, in train
    out = model(data)
  File "/home/bowen/anaconda3/envs/bowen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "modelnet_test.py", line 117, in forward
    sa1_out = self.sa1_module(*sa0_out)
  File "/home/bowen/anaconda3/envs/bowen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "modelnet_test.py", line 56, in forward
    x = self.conv(x, (pos, pos[idx]), edge_index)
  File "/home/bowen/anaconda3/envs/bowen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bowen/anaconda3/envs/bowen/lib/python3.6/site-packages/torch_geometric/nn/conv/point_conv.py", line 66, in forward
    return self.propagate(edge_index, x=x, pos=pos)
  File "/home/bowen/anaconda3/envs/bowen/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 121, in propagate
    out = scatter_(self.aggr, out, edge_index[i], dim_size=size[i])
  File "/home/bowen/anaconda3/envs/bowen/lib/python3.6/site-packages/torch_geometric/utils/scatter.py", line 33, in scatter_
    out[out == fill_value] = 0
RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/THC/THCTensorMathCompare.cuh:82

rusty1s commented 5 years ago

Ok, maybe its related to this issue. Can you please try to install torch-geometric from master? This issue may be already fixed there.

pyg-team / pytorch_geometric

pointnet++ bug when change batch size #502