Enhance Dataloader to handle small graph sampling

pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch

https://pyg.org

MIT License

21.37k stars 3.66k forks source link

Enhance Dataloader to handle small graph sampling #7946

Open jay-bhambhani opened 1 year ago

jay-bhambhani commented 1 year ago

🚀 The feature, motivation and pitch

We would like to be able to enhance dataloaders to specifically handle the case of loading large volumes of small graph data. Currently, PyG is primarily able to handle only large, highly connected graph data.

Alternatives

Currently, we can do this via the dataset, but a lot of our data will not fir into memory

Additional context

No response

jay-bhambhani commented 1 year ago

happy to contribute to this one in any way possible!

rusty1s commented 1 year ago

Thanks for starting this issue. Relevant slack discussion: https://torchgeometricco.slack.com/archives/C01DN0B3B1N/p1693220997281019?thread_ts=1692902721.218399&cid=C01DN0B3B1N

I see people are using lmdb for this, see https://github.com/Open-Catalyst-Project/ocp/blob/main/ocpmodels/datasets/lmdb_database.py. It would be pretty cool to add such an option to PyG's datasets.

jay-bhambhani commented 1 year ago

Hi Matthias! Thank you so much for this. My team and I would love to take this on - however we've been asked if we might be able to discuss this a bit more with you so we can scope out and contribute. Would you potentially have some time next week to discuss? We are more than happy to work around you schedule, so just let us know!

A couple of questions off the bat. We love the idea of a memory mapped file - is there any interest in potentially adding a db?

Do we assume that this will tie into the featurestore and graphstore abstractions that already exist? In theory I know that we could also store features in the same database if we are using something more like a generic kv or rdbms.

Thanks for all of you support and guidance with this! I know we are extremely excited to contribute to this project!

rusty1s commented 1 year ago

Sure, we can discuss. What timezone are you in? I am in Europe.

For DB integration: I think I would implement this in a separate interface, and then do the integration in torch_geometric.data.Dataset. There is follow-up opportunity to actually implement a FeatureStore with it, but I wouldn't tie them necessarily together.

jay-bhambhani commented 1 year ago

We are in the US Eastern time zone - so I’m sure we can find a time that works for both of us!

Thanks for the suggestions! Looking forward to chatting soon!

On Fri, Sep 1, 2023 at 11:56 AM Matthias Fey @.***> wrote:

Sure, we can discuss. What timezone are you in? I am in Europe.

For DB integration: I think I would implement this in a separate interface, and then do the integration in torch_geometric.data.Dataset. There is follow-up opportunity to actually implement a FeatureStore with it, but I wouldn't tie them necessarily together.

— Reply to this email directly, view it on GitHub https://github.com/pyg-team/pytorch_geometric/issues/7946#issuecomment-1702970534, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJ2LDUEONYBUVQIIO4U77TXYIATZANCNFSM6AAAAAA4BWGVDA . You are receiving this because you authored the thread.Message ID: @.***>

rusty1s commented 1 year ago

Does 4PM CEST on Thursday work for you? You can send an invite to matthias@kumo.ai.