IndexError in MetaPath2Vec

HughBlayney commented 3 years ago

🐛 Bug

Hi,

I'm getting an IndexError when training MetaPath2Vec on my own dataset. The stack trace is IndexError: Caught IndexError in DataLoader worker process 4. Original Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/GNN2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop data = fetcher.fetch(index) File "/home/ubuntu/anaconda3/envs/GNN2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/home/ubuntu/anaconda3/envs/GNN2/lib/python3.7/site-packages/torch_geometric/nn/models/metapath2vec.py", line 157, in sample return self.pos_sample(batch), self.neg_sample(batch) File "/home/ubuntu/anaconda3/envs/GNN2/lib/python3.7/site-packages/torch_geometric/nn/models/metapath2vec.py", line 123, in pos_sample batch = adj.sample(num_neighbors=1, subset=batch).squeeze() File "/home/ubuntu/anaconda3/envs/GNN2/lib/python3.7/site-packages/torch_sparse/sample.py", line 22, in sample return col[rand] IndexError: index 1549811 is out of bounds for dimension 0 with size 1549811

From what I understand, it looks like the final entry in the rowptr tensor in sample is being referenced, which is an index out of bounds for the col tensor (as it is equal to the length of the col tensor). However, it looks like this doesn't happen on the default AMiner dataset, despite the fact that the subset tensor is a subset of a larger tensor in which the maximum value would index the final value in rowptr. Therefore I think I'm misunderstanding part of the code, so any help would be very much appreciated.

Reproducing the behaviour is complicated because I can't get the error to occur on the AMiner dataset, and I'm unable to share the dataset I'm working with. If it would be helpful for me to report back any metrics, or the results of any functions on my dataset, please let me know and I'll do what I can.

Thank you very much for your time, and for putting together such a fantastic library!

Environment

OS: Ubuntu 18.04.5
Python version: 3.7.10
PyTorch version: 1.7.1+cu101
CUDA/cuDNN version: 10.1, V10.1.243
GCC version: 7.5.0

rusty1s commented 3 years ago

This seems to be an issue with isolated nodes. In particular, you may want to pass the num_nodes_dict argument to the MetaPath2Vec model.

ruzihao commented 3 years ago

This seems to be an issue with isolated nodes. In particular, you may want to pass the num_nodes_dict argument to the MetaPath2Vec model.

I had a similar issue. And I agree it should be an issue with isolated nodes. However, setting the num_nodes_dict argument does not solve the problem. Anyone has a better idea?

rusty1s commented 3 years ago

Do you have a small example to reproduce?

xuyxu commented 2 years ago

Hi @rusty1s, I have created a HeteroData with following statistics:

HeteroData(
  (node_type_A, relation_A, node_type_A)={ edge_index=[2, 9000000] },
  (node_type_A, relation_B, node_type_A)={ edge_index=[2, 18000000] }
)

After passing this graph into a metapath2vec model, it correctly identifies the number of nodes: model.num_nodes_dict={'node_type_A': 5000000}. However, the training procedure corrupts and reports the same IndexError.

I am sure that some nodes included in relation_B do not have any relation_A, is this the reason causing the IndexError, since metapath2vec works fine if only one meta path was passed into the model.

rusty1s commented 2 years ago

I think MetaPath2Vec should well be able to handle nodes with zero out-going edges. Any chance you have a small example to reproduce?

Amayama commented 2 years ago

Hi @rusty1s , similar problem appeared when I test with my dataset. And I try to build a toy project which can help you to reproduce and know my problem. The project is in https://github.com/Amayama/pyg_error_toy Thanks for your help!

rusty1s commented 2 years ago

Thank you. This helps a lot. The issue is that your graph contains isolated nodes, so that random walk generation fails. I'm not yet sure how to fix this without introducing a lot of computational overhead, but I'm looking into it. In particular, in your example, most nodes are isolated, and as a result, random-walk based learning methods cannot give you meaningful embeddings in the first place.

xuyxu commented 2 years ago

Currently, torch_geometric.transforms.remove_isolated_nodes cannot properly handle the heterogeneous graph, right?

rusty1s commented 2 years ago

Sadly not yet, and it does not really resolve this issue, as there might be nodes that are only isolated for a few edge types, while they are connected to some nodes for other edge types. I'm trying to fix this directly in MetaPath2Vec.

rusty1s commented 2 years ago

Should be fixed when installing from master, see https://github.com/pyg-team/pytorch_geometric/pull/3353. Closing this issue now. Feel free to re-open it in case you meet any issues.

pyg-team / pytorch_geometric

IndexError in MetaPath2Vec #2273

🐛 Bug

Environment