Why random drop some edges can avoid segmentation fault?

LukeLIN-web commented 1 year ago

https://github.com/snap-stanford/ogb/blob/a47b716f7e972f666eae9909ee0f922cd0f9d966/examples/nodeproppred/papers100M/node2vec.py#L57

I met some problems when I tried to run graphsage in the papers100M dataset. Could anybody give me some advice?

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch_geometric/nn/conv/message_passing.py", line 239, in __lift__
    return src.index_select(self.node_dim, index)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "quiver_benchmark.py", line 82, in train
    out = model(x[n_id], adjs)
ValueError: Encountered a CUDA error. Please ensure that all indices in 'edge_index' point to valid indices in the interval [0, 951644) in your node feature matrix and try again.

  File "/opt/conda/lib/python3.8/site-packages/torch_geometric/nn/aggr/base.py", line 126, in __call__
    if index.numel() > 0 and dim_size <= int(index.max()):
RuntimeError: CUDA error: an illegal memory access was encountered

weihua916 commented 1 year ago

Most likely because PyTorch did not support the tensor with such a large size. We needed to drop some elements so that PyTorch ran fine. I am not sure if dropedge is needed in the latest Pytorch, so it may be worth a try without the hack.

Also, you are pointing to the node2vec code. Can you point us to the graphsage code you used?

LukeLIN-web commented 1 year ago

Most likely because PyTorch did not support the tensor with such a large size. We needed to drop some elements so that PyTorch ran fine. I am not sure if dropedge is needed in the latest Pytorch, so it may be worth a try without the hack.

Also, you are pointing to the node2vec code. Can you point us to the graphsage code you used?

It seems that the sampler can sample and avoid large-size problems. But I met another problem.

class SAGE(torch.nn.Module):

    def __init__(self,
                 in_channels,
                 hidden_channels,
                 out_channels,
                 num_layers=2):
        super(SAGE, self).__init__()
        self.num_layers = num_layers

        self.convs = torch.nn.ModuleList()
        self.convs.append(SAGEConv(in_channels, hidden_channels))
        for _ in range(self.num_layers - 2):
            self.convs.append(SAGEConv(hidden_channels, hidden_channels))
        self.convs.append(SAGEConv(hidden_channels, out_channels))

    def forward(self, x: Tensor, adjs: list) -> Tensor:
        for i, (edge_index, _, size) in enumerate(adjs):
            x_target = x[:size[1]]  # Target nodes are always placed first.
            x = self.convs[i]((x, x_target), edge_index)
            if i != self.num_layers - 1:
                x = F.relu(x)
                # x = F.dropout(x, p=0.5, training=self.training)
        return x.log_softmax(dim=-1)

    @torch.no_grad()
    def inference(self, x_all, device, subgraph_loader):
        for i in range(self.num_layers):
            xs = []
            for batch_size, n_id, adj in subgraph_loader:
                edge_index, _, size = adj.to(device)
                x = x_all[n_id].to(device)
                x_target = x[:size[1]]
                x = self.convs[i]((x, x_target), edge_index)
                if i != self.num_layers - 1:
                    x = F.relu(x)
                xs.append(x)

            x_all = torch.cat(xs, dim=0)

        return x_all
...

    y = data.y.to(rank)
    x = data.x
   target_node = n_id[:batch_size]
            adjs = [adj.to(rank) for adj in adjs]
            out = model(x[n_id].to(rank), adjs)
            loss =  F.nll_loss(out, y[target_node].squeeze(1))

Traceback (most recent call last): File "paper100m.py", line 75, in train loss = criterion( File "/root/share/gnnproject/microGNN/models/criterion.py", line 12, in criterion loss = F.nll_loss(logits, labels.squeeze(1)) File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2671, in nll_loss return torch._C._nn.nll_loss_nd(input, target, weight, _Reduction.get_enum(reduction), ignore_index) RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Float'

rusty1s commented 1 year ago

What's the shape of out and y going into nll_loss?

LukeLIN-web commented 1 year ago

What's the shape of out and y going into nll_loss?

Thank you for your reply!

out torch.Size([1024, 172]) target torch.Size([1024, 1])

https://github.com/snap-stanford/ogb/pull/427#issuecomment-1501121794

rusty1s commented 1 year ago

Can you try to make y a LongTensor?

LukeLIN-web commented 1 year ago

Can you try to make y a LongTensor?

Thank you. It works.

snap-stanford / ogb

Why random drop some edges can avoid segmentation fault? #425