[Feature Request]: Larger than memory datasets.

JonathanSchmidt1 commented 9 months ago

Problem

Now that multi-gpu training is working we are very interested in training on some larger crystal structure datasets. However, the datasets do not fit into the RAM. It would be great if it would be possible to either have to only load a partial dataset on each ddp node or be able to load the features on the fly to make large scale training possible. I assume the LMDB datasets that the OCP and MATSCIML are using should work for that. Ps: thank you for the pytorch lightning implementation

Proposed Solution

Add an option to save and load data to/from an LMDB database.

Alternatives

Examples would be here https://github.com/Open-Catalyst-Project/ocp/blob/main/tutorials/lmdb_dataset_creation.ipynb or here https://github.com/IntelLabs/matsciml/tree/main/matsciml/datasets

Code of Conduct

[X] I agree to follow this project's Code of Conduct

JonathanSchmidt1 commented 8 months ago

A small update to this request. I also asked the matsciml team about this issue as they have an interface to matgl and other models included in their package and they were kind enough to prepare a guide on how to prepare a suitable dataset https://github.com/IntelLabs/matsciml/issues/85 . I will follow that guide for our data and maybe it could be used to extend the training capabilities of matgl as well.

shyuep commented 8 months ago

@JonathanSchmidt1 The dataloaders in matgl already allows you to do a one-time processing of the structures into a graph dataset. Once that graph dataset is done, it is much smaller in memory than the structures. In fact, that is the way we have been training extremely large datasets.

JonathanSchmidt1 commented 7 months ago

Thank you for the reply. The preprocessing is definitely useful. But after preprocessing a decently sized dataset (4.5M structures) takes up 132Gb on disk and 128Gb loaded into RAM and we would like to train on larger datasets in the future. Maybe I am also doing sth wrong. Right now I am just doing the following to preprocess data:

elem_list = get_element_list(structures)
# setup a graph converter
converter = Structure2Graph(element_types=elem_list, cutoff=6.0)
# convert the raw dataset into MEGNetDataset
dataset = M3GNetDataset(
    threebody_cutoff=4.0, structures=structures, converter=converter, labels={"energies":energies}
)
dataset.process()
dataset.save()

For me, the issue is also the rather old architecture of the gpu partition of the supercomputer I have to use (64Gb Ram per node, 1 Gpu per node). So in-memory datasets are just not an option there. However even on a modern architecture with e.g. 512 Gb per node and 8 gpus per node the in-memory dataset becomes a problem as I think with ddp each process will load its own copy of the dataset resulting in ~960 Gb if I would use all gpus?

JonathanSchmidt1 commented 3 weeks ago

Are there any updates on this? Even though we have decent nodes now with 200gb ram per gpu the datasets also take up more than 600gb now.

materialsvirtuallab / matgl