loader.ClusterData (Metis) crashes for directed graphs

pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch

https://pyg.org

MIT License

21.52k stars 3.69k forks source link

loader.ClusterData (Metis) crashes for directed graphs #4893

Open MartinSchmitz95 opened 2 years ago

MartinSchmitz95 commented 2 years ago

🐛 Describe the bug

Hello, I am trying to divide a directed graph into clusters using Metis. loader.ClusterData(graph, num_parts=100, recursive=False)

The ClusterData works as long as 'num_parts' is very small (<20). As soon as I choose a higher parameter like 100, it crashes with a Segmentation fault error. When I convert my graphs into undirected graphs it works without problems, but I would like to keep the directed graph.

Is there a fix for this Metis problem? Or is there a workaround, maybe to reconstruct the directed graph after the Metis partitioning?

Environment

PyG version: 2.0.5
PyTorch version: 1.10.1
OS: Ubuntu
Python version: 3.9
CUDA/cuDNN version: Both CPU and Cuda 10.3

rusty1s commented 2 years ago

We cannot control metis execution, and as far as I know it expects an undirected graph as input. As such, I recommend that you input an undirected graph as input, collect the partitions and node ids, and then apply them on your directed graph via data.subgraph(). WDYT?

MartinSchmitz95 commented 2 years ago

Thank you very much. My code looks like this now and it seems to work.

Set ids in graph manually:

graph.node_ids = torch.arange(pyg_graph.num_nodes)

Transform graph to undirected:

transform = T.ToUndirected()
undir_graph = transform(graph)

Run metis on the undirected graph

train_cluster_data = loader.ClusterData(undir_graph, num_parts=num_clusters, recursive=False, save_dir='../data/cache')
train_loader = loader.ClusterLoader(train_cluster_data, batch_ size=batch_size, shuffle=True)

Take the node ids of the metis partition and create a subgraph of the original directed graph out of it

for data in train_loader:
    data = graph.subgraph(data.node_ids)

I am not sure if my manual id setting with torch arrange works as intended though.

rusty1s commented 2 years ago

I think this looks correct. Does it work? :)

MartinSchmitz95 commented 2 years ago

It works, I think we can close this thread. Thanks a lot for your help :)

Just one thing I also want to mention: I have edge features in my graph data.e. After using the subgraph function as shown, the feature matrix stays the same. In order to retrieve only the edge features of the subgraph, I have to take: data.e[data.edge_index][0]

rusty1s commented 2 years ago

This shouldn‘t be the case. subgraph should be able to handle both node and edge features. Can you show me an example?

MartinSchmitz95 commented 2 years ago

Hmm I tried to replicate the problem on my local machine, and there the subgraph works perfectly fine. It only behaves like this when I run it on my server. It could be related to CUDA.

rusty1s commented 2 years ago

Interesting. Let me know if you can share your data or have some additional pointers on where the error might occur from. You can also test your script with env variable `CUDA_LAUNCH_BLOCKING=1 for better error messages.

willleeney commented 1 year ago

@rusty1s This code replicates the issue. If you set the argument force_undirected=True, then the error no longer occurs. The error occurs in the metis.py file in torch_sparse on line 67 cluster = torch.ops.torch_sparse.partition(rowptr, col, value, num_parts, recursive)

from torch_geometric.datasets.amazon import Amazon
from torch_geometric.loader import ClusterData, ClusterLoader
from torch_geometric.data import Data
import torch
from torch_geometric.utils import dropout_edge

def set_random(random_seed: int):
    torch.manual_seed(random_seed)
    torch.cuda.manual_seed_all(random_seed)
    return

set_random(42)
dataset_name = 'Computers'
n_clusters = 10

data = Amazon(root=f'data/{dataset_name}', name=dataset_name)[0]
cluster_data = ClusterData(data, num_parts=n_clusters)
train_loader = ClusterLoader(cluster_data, batch_size=1, shuffle=False)

for i, batch in enumerate(train_loader):
    train_edge_index, train_edge_mask = dropout_edge(batch.edge_index, p=0.7, force_undirected=False)
    split_data = Data(x=batch.x, y=batch.y, edge_index=train_edge_index)
    cluster_data = ClusterData(split_data, num_parts=10)