vijaydwivedi75 / lrgb

Long Range Graph Benchmark, NeurIPS 2022 Track on D&B
MIT License
149 stars 18 forks source link

Any URL to directly download these dataset? #5

Closed HB-pencil-zero closed 1 year ago

HB-pencil-zero commented 1 year ago

Or we must use this code to generate them?

HB-pencil-zero commented 1 year ago

Thanks for your code, but I can't find how should we get the dataset of peptides . Can you give me some instructions?

rampasek commented 1 year ago

Hi,

All the datasets are deposited at Zenodo for a direct download: https://zenodo.org/record/6975830

You can find the PyG dataset loaders in this repo or also in the GraphGPS main repo: https://github.com/rampasek/GraphGPS/tree/main/graphgps/loader/dataset These PyG loaders use dropbox- or S3-hosted URLs for faster download than from the Zenodo servers.

I hope this answers your questions!

Best, Ladislav

HB-pencil-zero commented 1 year ago

Hi,

All the datasets are deposited at Zenodo for a direct download: https://zenodo.org/record/6975830

You can find the PyG dataset loaders in this repo or also in the GraphGPS main repo: https://github.com/rampasek/GraphGPS/tree/main/graphgps/loader/dataset These PyG loaders use dropbox- or S3-hosted URLs for faster download than from the Zenodo servers.

I hope this answers your questions!

Best, Ladislav

Thanks for your nice answer, but there are still a few things I don't understand about pcqm4m-contact dataset. I found it to be of size (3378606, 1) which is very confusing to me, according to the paper it should have 529,434 graphs , could you please tell me what format the pcqm4m-contact dataset organized?

rampasek commented 1 year ago

I found it to be of size (3378606, 1) which is very confusing to me, according to the paper it should have 529,434 graphs , could you please tell me what format the pcqm4m-contact dataset organized?

The dataset is derived from the training set of OGB-LSC PCQM4Mv2. We computed the "contact" edges for all those ~3.3M molecules, but for computational limitations, we only use the 530k subset of the dataset by using every 6th molecule from the original set. So that is why the full file has 3.3M entries. You can find this in the loaders that I mentioned above, particularly this part: https://github.com/rampasek/GraphGPS/blob/6305368446275f9d5f3736d44bf265214d4f9a9b/graphgps/loader/dataset/pcqm4mv2_contact.py#L408 Let me know if you have any more questions!