pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.05k stars 3.63k forks source link

Failing to train Pointnet2 segmentation on custom dataset #2358

Closed jediofgever closed 3 years ago

jediofgever commented 3 years ago

❓ Questions & Help

Hello,

I am quite new to pointcloud learning. I have did some tutorials in pytorch_geometric but now I encounter something that i cant quite understand so I appriciate your help on this. I have large pointcloud maps that I use for navigation of robots, The pointclouds maps are generated and labeled from simulations. I want to train networks to segment derivable and non derivable regions. I created a Dataset for my purpose on my fork named ; uneven_ground_dataset.py

I also modified the pointnet2_segmentaion.py

When I start training I encounter following prolem;

ros2-foxy@ros2foxy-Lenovo-ideapad-700-15ISK:~/pytorch_geometric$ python3 examples/pointnet2_segmentation.py 
mm Intializing UnevenGroundDataset dataset
download function is void, makesure data is locally availabe and under provided root folder
Traceback (most recent call last):
  File "examples/pointnet2_segmentation.py", line 125, in <module>
    train()
  File "examples/pointnet2_segmentation.py", line 86, in train
    out = model(data)
  File "/home/ros2-foxy/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "examples/pointnet2_segmentation.py", line 58, in forward
    sa1_out = self.sa1_module(*sa0_out)
  File "/home/ros2-foxy/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ros2-foxy/pytorch_geometric/examples/pointnet2_classification.py", line 21, in forward
    row, col = radius(pos, pos[idx], self.r, batch, batch[idx],
  File "/usr/local/lib/python3.8/dist-packages/torch_geometric-1.6.3-py3.8.egg/torch_geometric/nn/pool/__init__.py", line 173, in radius
    return torch_cluster.radius(x, y, r, batch_x, batch_y, max_num_neighbors,
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/ros2-foxy/.local/lib/python3.8/site-packages/torch_cluster/radius.py", line 53, in radius
    if batch_x is not None:
        assert x.size(0) == batch_x.numel()
        batch_size = int(batch_x.max()) + 1
                     ~~~ <--- HERE

        deg = x.new_zeros(batch_size, dtype=torch.long)
RuntimeError: CUDA error: the launch timed out and was terminated

ros2-foxy@ros2foxy-Lenovo-ideapad-700-15ISK:~/pytorch_geometric$ 

I dont have a dedicted computer for DL at the moment I use minimal batch size. I searched for possible causes but I could not figure out why.

I have a few .pcd fle and I could provide them if you want to reproduce the issue.

Thank youu very much for your time.

rusty1s commented 3 years ago

Can you show me how batch_x looks like before the crash?

jediofgever commented 3 years ago

https://colab.research.google.com/drive/1Vc5_yXZguGllK-ZQWgJj-4Us19bAyy_Z?usp=sharing

My collab file is here.

I am not sure what you mean by batch_x

It seems that DataLoader cant properly load. I can correctly see train_dataset.data.pos and train_dataset.data.y

train_dataset = UnevenGroundDataset(
    root="/opt/uneven_ground_dataset/", transform=None, pre_transform=None
)

train_loader = DataLoader(train_dataset, batch_size=1, shuffle=True, num_workers=1)

DataLoader doesnt return properly, so the error complains about an empty tensor train_loader in train_loader. But I cannot see why

rusty1s commented 3 years ago

There seems to be an example with no points in it. Is that true?

min_nodes = float('inf')
for data in train_dataset:
    min_nodes = min(min_nodes, data.num_nodes)
print(min_nodes)
jediofgever commented 3 years ago

No,

that scripts produces

Data(pos=[4065298, 3], x=[4065298, 3], y=[4065298])
Intializing UnevenGroundDataset dataset
print(min_nodes): 461234

where pos is point locations, x is point normals, y is the label of each point

rusty1s commented 3 years ago

Can you test if if works to first put the required tensors to CPU before calling radius here?

row, col = radius(pos.cpu(), pos[idx].cpu(), self.r, batch.cpu(), batch[idx].cpu(),
                          max_num_neighbors=64)
row, col = row.cuda(), col.cuda()
jediofgever commented 3 years ago

The dateset initialization was the problem. I wasn't transferring points to data.pos as torch tensor correctly. Now I can execute the training process. But network isnt learning anything in first 10 epochs. I have like 3-4 millions of labeled points

jediofgever commented 3 years ago

I down sampled the cloud and I get 0.92 accuracy at last epoch in the training phase. However when testing(with identical data) network cannot predict anything. The model is over fitting but is it normal to that I get no predictions at all to identical data?

rusty1s commented 3 years ago

Using the training data during inference should also yield 0.92 accuracy. If it does not do so, there might be some differences in the code regarding training and inference computation, e.g., induced by BatchNorm or Dropout.