Performance issues with ogbl-biokg graph in DeepSNAP

snap-stanford / deepsnap

Python library assists deep learning on graphs

MIT License

542 stars 57 forks source link

Hello,

I am trying to use the ogbl-biokg (docs | github) with the DeepSNAP package. The graph has 5.088.434 edges and 93.773 nodes. I created a custom dataset (link to the code), but I have massive performance issues.

The problem is that it takes more than 30 min for the graph to process and generate the HeteroGraph object:

hetero = HeteroGraph(G)

And that the memory consumption is too much, even for a node with 256GB when I start the training, so it always crashes. I am using it in the link prediction with the heterogeneous GraphSAGE model (tutorial colab from DeepSNAP). I think the problem might be using networkx in the backend. I tried loading the graph with the StellarGraph package via numpy arrays, with are much more efficient. All of the graph loads within a minute, even on a CPU.

Is there any suggestion you have as to how to better load the data into DeepSNAP? Or could you possibly integrate the ogbl-biokg graph as a dataset into your library, considering the ogb package is also part of snap-stanford ? This would be very helpful!

Hi,

Thanks for pointing out this. Right, handling the graph data by using the NetworkX graph object seems not efficient. But the performance issue for generating the HeteroGraph might be mainly caused by what DeepSNAP does internally, transforming the NetworkX graph into tensors and in the link prediction case it will also split multiple negative edges. These can actually cause the performance / memory issue. One potential solution is to not use DeepSNAP if you don't need to manipulate the graph heavily (for example during training). You can use PyG directly with its transforms functions. Also, now the heterogeneous functionality has also been merged into the PyG and you can use it from PyG directly. If you have heavy graph manipulation requirements and need to use the graph algorithm from NetworkX, you can try to feed tensors directly such as this example but I am not sure whether this can work for the link prediction task (also not very sure about the performance). I will benchmark and try to find the performance issue if I have time recently.

Thanks.

snap-stanford / deepsnap

Performance issues with ogbl-biokg graph in DeepSNAP #40