Interface for datasets that are too large to use `InMemoryDataset`

hatemhelal commented 2 years ago

🚀 The feature, motivation and pitch

There are several examples of datasets for molecular property prediction where each individual graph example easily fits in memory but there are too many examples to fit within the InMemoryDataset interface. One solution is to save each example in its own .pt file but this introduces a significant filesystem overhead to access each example.

A better solution is to partition the data such that there are many graphs serialised within a single .pt file. The number of graphs can be considered a chunk_size parameter which is independent from the training batch_size. This ChunkedDataset interface would be expected to scale to as large a dataset as desired while avoiding the significant overhead of having one graph per file.

The design idea is roughly:

ChunkedDataset inherits from the PyG Dataset interface
Accepts a chunk_size argument
Has an abstract method process_chunk that accepts a list of data objects that can be processed and saved as a single .pt file.

Other considerations:

The training batch size should not depend on the chunk_size so the dataset
ChunkedDataset should support splitting to read from parallel workers as well as random shuffling

Alternatives

No response

Additional context

No response

rusty1s commented 2 years ago

These are all pretty good thoughts. The ChunkedDataset API looks good to me. One additional consideration is that a ChunkedDataset likely requires modification to torch_geometric.data.DataLoader as well. We have two options here:

chunked_dataset[0] returns a batch of graphs/list of graphs, and DataLoader(chunked_dataset, batch_size=4) determines how many chunks we want to batch together. This requires additional logic during collate_fn.
chunked_dataset acts as a regular dataset where chunked_dataset[0] returns a single graph, but caches remaining data of that chunk in memory for faster access. This likely requires implementing our own sampler logic during data loading to guaranteee that consecutive examples are accessed (independent of shuffle=True)

Padarn commented 2 years ago

Interesting idea for a feature: Seems like something a lot of people could possibly use.

Just wanted to query a bit on

One solution is to save each example in its own .pt file but this introduces a significant filesystem overhead to access each example. Have you run any benchmarks on this? I don't know how big the graphs are usually, but I'm just curious how big of a difference the IO operations make to the overall processing rate.

Reason I ask is that chunking in files will make shuffle=True not actually shuffle randomly, and I'm wondering if there is a way we can retain this functionality. But maybe this is not important at all, just a thought :-)

rusty1s commented 2 years ago

Reason I ask is that chunking in files will make shuffle=True not actually shuffle randomly, and I'm wondering if there is a way we can retain this functionality. But maybe this is not important at all, just a thought :-)

Yes, with this approach, we would only be able to shuffle chunks of graphs together to form a mini-batch. An alternative solution might be to utilize some database system (e.g., RocksDB) and save and load your data from there.

Padarn commented 2 years ago

Could also consider trying to do something with formats like https://capnproto.org/ and mem maps. https://groups.google.com/g/capnproto/c/Q_9klBP6rrg

LiuHaolan commented 2 years ago

Hi, I'd like to build this chunkeddataset support if no one else is doing it. :-) My own project also needs that support.

Padarn commented 2 years ago

I don't think anyone is working on it :-)

rusty1s commented 2 years ago

@LiuHaolan Amazing, let me know if we can help in any way.

hatemhelal commented 2 years ago

@LiuHaolan that would be amazing if you wanted to get this going as it was going to take me a few weeks before I can work on this. Happy to help with any PR reviews

LiuHaolan commented 2 years ago

I have some questions regarding the original design, why do we need to implement the abstract method process_chunk? Can't we just take a similar approach like InMemoryDataset, asking users to load the chunk in self.data and self.slices (in that case those members should be a list) ?

rusty1s commented 2 years ago

I agree, we probably need to require a process_example method and do any logic of creating chunks internally. WDYT?

hatemhelal commented 2 years ago

There are probably many other ways to achieve this but I think I had in mind that the ChunkedDataset interface would be responsible for the chunking logic. The process_chunk method defers to the concrete client class the logic for reading the data from some external format . This is roughly the template method if you want an additional pointer for researching the pros/cons of this approach.

I have some doubts about this proposed feature. PyTorch has the IterableDataset for representing datasets that do not fit into memory. Has anyone looked at using this interface with PyG before?

rusty1s commented 2 years ago

Yes, ChunkedDataset should take care of chunking logic internally, but we still would need to have a process logic on a per graph/example level, right?

LiuHaolan commented 2 years ago

So basically my current thoughts are that users need to write the chunk logic in process() and store them in some data structures such as self.chunked_data and self.chunked_slices (both are lists right now, compared with a single data/slices in InMemoryDataset), so that the ChunkedDataset will be able to load them in the len() and get() method (can be on-demand or prefetching)?

I have implemented a hard-coded version of chunkeddataset (only apply to my dataset) and it worked. I will work to retrofit my code to a generic version.

I agree, we probably need to require a process_example method and do any logic of creating chunks internally. WDYT?

josiahbjorgaard commented 2 years ago

Is anyone working on this?

For my own purposes, I'm currently implementing logic to save processed data as chunks to disk and load on-the-fly to a dataset for training.

How should batch size on the dataset be handled to match with chunk on-the-fly loading? It would defeat the purpose if all chunks are loaded into memory at any given time, so batching needs to match the chunk indices to some extent - and loading chunks needs to write into the same memory space.

rusty1s commented 2 years ago

Yes, your batch size should be divisible by the chunk size. I am not sure anyone is working on this though.

LiuHaolan commented 2 years ago

Is anyone working on this?

For my own purposes, I'm currently implementing logic to save processed data as chunks to disk and load on-the-fly to a dataset for training.

How should batch size on the dataset be handled to match with chunk on-the-fly loading? It would defeat the purpose if all chunks are loaded into memory at any given time, so batching needs to match the chunk indices to some extent - and loading chunks needs to write into the same memory space.

Hi! Sorry I am a bit busy these days and don't have time to refractor the code. You can work on this if you want.

josiahbjorgaard commented 2 years ago

The best approach that I see for this is to add support for PyTorch's IterableDataset by creating an 'IterableDataset' version of PyG's Dataset class. Then, to support dataset chunk loading, provide examples for processing large graph datasets into shards and loading them on-the-fly via implementations of the added PyG IterableDataset class. As noted above, this inherits from the PyTorch solution to this issue.

Using IterableDataset solves some of the issues around matching batch size to chunk size, fetching index specified dataset samples from out of memory chunks, and the challenge of implementing reusable logic for chunk saving and loading, which will likely be dependent on the user's specific application.

Any thoughts or comments?

rusty1s commented 2 years ago

Interesting. Initially I thought about something simpler by just letting dataset[0] return a Batch of examples (rather than a single data object) - I think your solution is definitely more elegant but might be harder to implement?

hatemhelal commented 2 years ago

I think targeting a pyg version of IterableDataset makes a lot of sense. I really like the feature of applying a pre_transform as a static preprocessing that only needs to be evaluated once. Ideally it would be nice to have a way to keep that feature so the data flow looks like:

user handles reading a graph from some external format (would be nice if this could support parallel or vectorised implementation too)
pyg handles the sharding (what I call chunking in the original idea) and saves the Data objects after possibly applying the optional pre_transform (using .pt files or some other well defined format?)
when loading we might have parallel workers so each one is configured to deal with a subset of the shards of the dataset using get_worker_info()

I think it should be possible to hide the occasional latency that comes with loading a shard file into memory. There would be a tradeoff between memory used and how much you can parallelise in the data loader so it would be nice to allow users to configure the shard size.

Another thing I'm looking into is the newer torchdata classes which introduces the IterDataPipe class. One neat feature is the out-of-the-box support for reservoir sampling to shuffle the dataset (see shuffler)

The fact that the datapipe API is in beta gives me some pause but maybe this isn't too much to worry about?

rusty1s commented 2 years ago

@hatemhelal PyG has some good support for data pipes already, see https://github.com/pyg-team/pytorch_geometric/blob/master/examples/datapipe.py. Perhaps this might be useful to implement this.

josiahbjorgaard commented 2 years ago

Here's a first try creating an IterableDataset class for PyG - I've been using this in my own project, implementing the iter function for my dataset and training with it.

As mentioned above, the latency of the dataloader to load a shard is a downside, so I'm working on an asynchronous queue for shard loading in my projects implementation of the iter function. That same implementation does parallel workers as mentioned above.

The custom dataset which has an implemented iter function, inheriting iter from the proposed IterableDataset class, is where the most useful functionality is. Should that not go in an example doc?

ashkspark commented 1 year ago

Can we leverage Neo4j and query only what we need for PyG computation? How hard creating the pipeline looks? Do you have some materials you can share?

rusty1s commented 1 year ago

You might be able to connect your Neo4j graph via the remote backend interface PyG provides, see here.

pyg-team / pytorch_geometric