snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

Code2 example broken #260

Closed shyam196 closed 2 years ago

shyam196 commented 2 years ago

If you try running the code2 example, there is an out of bounds error. It runs fine on my laptop, which has an old version of the code2 dataset downloaded several months ago. A fresh download does not seem to work.

Here's the error:

$ python main_pyg.py
Namespace(batch_size=128, dataset='ogbg-code2', device=0, drop_ratio=0, emb_dim=300, epochs=25, filename='', gnn='gcn-virtual', max_seq_len=5, num_layer=5, num_vocab=5000, num_workers=0, random_split=False)
Target seqence less or equal to 5 is 0.0%.
Traceback (most recent call last):
  File "main_pyg.py", line 270, in <module>
    main()
  File "main_pyg.py", line 176, in main
    vocab2idx, idx2vocab = get_vocab_mapping([dataset.data.y[i] for i in split_idx['train']], args.num_vocab)
  File "main_pyg.py", line 176, in <listcomp>
    vocab2idx, idx2vocab = get_vocab_mapping([dataset.data.y[i] for i in split_idx['train']], args.num_vocab)
IndexError: list index out of range

dataset.data.y has length 1, hence why the error is thrown.

Did something change in the version of the dataset uploaded?

weihua916 commented 2 years ago

Hi! The underlying dataset file did not change, and I obtained the same error as you. This seems to be caused by the change in the collate policy of pytorch geometric (I am using torch_geometric 2.0.1).

data0.y = ['a', 'b']
data1.y = ['c', 'd', 'e']
data_list = [data0, data1]

# after collating and using the pyg dataset object, I got
dataset.data.y = [['a', 'c']]
dataset.data.slices['y'] = [tensor([0,1])]
dataset[0].y = ['a']
dataset[1].y = ['c']

# expected behavior should be
dataset.data.y = [['a', 'b'], ['c', 'd', 'e']]
dataset[0].y = ['a', 'b']
dataset[1].y = ['c', 'd', 'e']

@rusty1s Could you please help check this?

rusty1s commented 2 years ago

Good catch. PyG 2.0 also tries to collate elements of lists (similar to how the standard PyTorch DataLoader handles lists), which leads to this change in outcome. I restored the original behavior in PyG for lists which hold elements such as integers and strings. As a result, you can fix this issue by installing PyG from master for now.

shyam196 commented 2 years ago

Awesome, thanks both!

shyam196 commented 2 years ago

I might have spoken too soon 🤔

I tried installing PyG from master (i.e. pip install git+...) but I still get the same error on the code2 example. The change in behaviour mentioned by @weihua916 is now fixed when I checked in the interpreter when I installed from master, but I think something else is perhaps the source of the issue? The length of dataset.data.y is still 1 on PyG master :-(

rusty1s commented 2 years ago

Ok, let me check that tomorrow:)

rusty1s commented 2 years ago

I checked once again and the above error is gone for me when using PyG master. Keep in mind that you need to re-process the dataset first. You can simply enforce this by removing the processed directory in the dataset folder.

shyam196 commented 2 years ago

That fixed it! Thanks 🙂