pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
20.91k stars 3.61k forks source link

Segmentation fault (core dumped) #1362

Open crea397 opened 4 years ago

crea397 commented 4 years ago

❓ Questions & Help

Hi,

I implemented the program with reference to examples/pointnet2_classification.py and used Google Colablatory's GPU to learn the model.

I save the model that I learned in Colab and try to call that model in Jetson Xavier NX to make an inference, I get the Segmentation fault (core dumped).

I ran the same code on the Jetson Nano, but in that case Segmentation fault (core dumped) did not occur.

I think I'm getting Segmentation fault (core dumped) when I run test().

How do I solve this problem?

Thanks!

Enviroment

$ uname -a
Linux Jetson Xavier NX 4.9.140-tegra #1 SMP PREEMPT Wed Apr 8 18:15:20 PDT 2020 aarch64 aarch64 aarch64 GNU/Linux
Device Nano Xavier NX
OS Ubuntu 18.04 Ubuntu 18.04
Python 3.6.9 3.6.9
torch 1.4.0 1.4.0a0+7f73f1d
torch.version.cuda 10.0 10.2
nvcc -V 10.0 10.2
torch-scatter 2.0.4 2.0.4
torch-sparse 0.6.1 0.6.1
torch-cluster 1.5.4 1.5.4
torch-spline-conv 1.2.0 1.2.0
torch-geometric 1.4.3 1.4.3
$ pointnet++.py
load model
Test Acc: 0.9238
Segmentation fault (core dumped)
rusty1s commented 4 years ago

Do you know where the segmentation fault occurs?

crea397 commented 4 years ago

I think I'm getting Segmentation fault (core dumped) when I run pred = model(data).

rusty1s commented 4 years ago

Yes, but do you know which operation in model(data) produces this error?

crea397 commented 4 years ago

@rusty1s I think the following are causing Segmentation fault. pointnet2_classification.py line 24 in SAModule

x = self.conv(x, (pos, pos[idx]), edge_index)

When I comment this out, I don't have Segmentation fault, but when I uncomment it, I have Segmentation fault.

rusty1s commented 4 years ago

Does that also happen when running on CPU? Whats the shape of pos, pos[idx] and the output of edge_index[0].max(), edge_index[1].max()?

crea397 commented 4 years ago

I used device = torch.device('cpu') instead of device = torch.device('cuda' if torch.cuda.is_ available() else 'cpu'). When I changed, Segmentation fault did not occur.

I checked the output for the same point cloud data in Nano and Xavier NX.

Device Nano Xavier NX
pos torch.Size([100, 3]) torch.Size([100, 3])
pos[idx] torch.Size([50, 3]) torch.Size(50, 3])
edge_index[0].max() tensor(99, device='cuda:0') tensor(99, device='cuda:0')
edge_index[1].max() tensor(49, device='cuda:0') tensor(49, device='cuda:0')
rusty1s commented 4 years ago

Can you do me a favor and test if scatter_max works on GPU on Xavier NX?

crea397 commented 4 years ago

I ran the following code, referring to torch-scatter.

Code

import torch
from torch_scatter import scatter_max

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
src = torch.tensor([[2, 0, 1, 4, 3], [0, 2, 1, 3, 4]], device=device)
index = torch.tensor([[4, 5, 4, 2, 3], [0, 0, 2, 2, 1]], device=device)

out, argmax = scatter_max(src, index, dim=-1)

print(out)
print(argmax)

Result

tensor([[0, 0, 4, 3, 2, 0],
        [2, 4, 3, 0, 0, 0]], device='cuda:0')
tensor([[5, 5, 3, 4, 0, 1],
        [1, 4, 3, 5, 5, 5]], device='cuda:0')
Dave0995 commented 2 years ago

i got the same error when i try to do inference in a jetson AGX Xavier. The code line that explode the segmentation core dumped is:

self.cfx = cuda.Device(0).make_context()

I'm trying to do inference using the nvidia Tensorrt 7.1. It's weird because when i use the optimization engine in a unique script, it works, but when i use gRPC for create a microservice, explode the segmentation core dumped error.

rusty1s commented 2 years ago

Can you do me a favor and test by re-installing torch-scatter and torch-sparse with the latest released wheels (uploaded yesterday). There were some changes that allow to support a larger variety of compute capabilities.

KCSesh commented 2 years ago

@Dave0995 Was any progress ever made on this?

This exact issue is showing for me, still.

self.cfx = cuda.Device(0).make_context() works in a stand alone script.

But it throws a segfault when another process is introduced.