usnistgov / alignn

Atomistic Line Graph Neural Network https://scholar.google.com/citations?user=9Q-tNnwAAAAJ&hl=en https://www.youtube.com/watch?v=WYePjZMzx3M
https://jarvis.nist.gov/jalignn/
Other
228 stars 83 forks source link

Running ALIGNN on Multi-GPUs #90

Open aydinmirac opened 1 year ago

aydinmirac commented 1 year ago

Dear All,

I would like to run ALIGNN on multi GPUs. When I checked the code I could not find any option.

Is there any method to run ALIGNN on multi GPUs such as using PyTorch Lightning or DDP function from PyTorch (Distributed Data Parallel)?

Best regards, Mirac

bdecost commented 1 year ago

There is a config option to wrap the model in DistributedDataParallel, but I think that may be a dead code path at this point. https://github.com/usnistgov/alignn/blob/736ec739dfd697d64b1c2a01dc84678a24bcfacd/alignn/train.py#L651

I think the most straightforward path is through ignite.distributed. Lightning would probably be pretty straightforward to use as well if you wrap our models and dataloaders (I tried using lightning once but it's been a while).

I have some experimental code for training on multiple GPUs that I think hasn't gotten merged into the the main repo. This more or less followed the ignite.distributed CIFAR example. I don't think there was much required beyond wrapping the model, optimizer, and dataloader in ignite's auto distributed helpers, and fixing the training entry point to have the right call signature

Cleaning that up is actually on my near-term to-do list. Is there a particular form or use case for multi-GPU training/inference that you think would be useful?

aydinmirac commented 1 year ago

Hi @bdecost,

Thank you so much for your reply.

Let me try your solution. I will inform you as soon as possible.

I am using 2xRTX-4500 and distributing data with the following PyTorch-Lightning example code snippet. It is very useful and simple:

trainer = pl.Trainer(
    callbacks=callbacks,
    logger=logger,
    default_root_dir=root_dir,
    max_epochs=200,
    devices=2,
    accelerator="gpu",
    strategy="ddp"
)
trainer.fit(task, datamodule=data)

My model does not fit into single GPU because of big molecule structures. These use cases are pretty common if you are studying big structures. I think adding a feature like that (I mean, being able to split a model easily) would be very helpful when using ALIGNN.

bdecost commented 1 year ago

My model does not fit into single GPU because of big molecule structures. These use cases are pretty common if you are studying big structures. I think adding a feature like that (I mean, being able to split a model easily) would be very helpful when using ALIGNN.

Are you looking to shard your model across multiple GPUs? In that case lightning looks like it might be able to set that up automatically? I've only tried using data parallelism with ignite to get better throughput, I'm not sure ignite supports model parallelism.

What kind of batch sizes are you working with? I'm just wondering if you could reduce your batch size and model size to work within your GPU memory limit, and maybe use data parallelism to get higher throughput / effective batch size. The layer configuration study we did indicates that you can drop to a configuration with 2 ALIGNN and 2 GCN layers without really sacrificing performance, and this substantially reduces the model size (not quite by a factor of 2). See also the cost vs accuracy summary of that experiment. I'm not sure this tradeoff would hold for larger molecular structures since the performance tradeoff study was done using Jarvis dft_3d dataset, but it might be worth trying.

aydinmirac commented 1 year ago

Hi @bdecost,

Let me elaborate my case.

I am using OMDB dataset which includes big molecules that has 82 atoms per molecule on average. That is quite bigger than QM9 dataset. As a model, I am using SchNetPack and ALIGNN to train the dataset.

I trained Schnet and ALIGNN on A100 (40GB GPU memory) before. Batch size and GPU memory usage on A100 are below:

Schnet --> batch size = 64, Memory usage = 39GB ALIGNN --> batch size = 48, Memory usage = 32GB

If I increase batch size, both models crash on A100. Now I have to train these models on 2xRTX A4500 (20GB memory each) and split them into 2 GPUs without out of memory problem. Schnet uses Lightning to eliminate this issue.

I just wondered if I can do this with ALIGNN. Otherwise, as you mentioned, I have to reduce batch size. Also I will try your suggestions as you mentioned above.

bdecost commented 1 year ago

Ok, cool. I wasn't sure if you had such large molecules that fitting a single instance in GPU memory was a problem.

What you want is more straightforward to do than model parallelism I think. Our current training setup doesn't make it as simple as a configuration setting, but I think it would be nice for us to support

JonathanSchmidt1 commented 10 months ago

We are also interested in training on some larger datasets. How is the state of distributed training right now?

knc6 commented 10 months ago

Hi,

did you try distributed:true which is based on accelerate package to manage multi-gpu training?

JonathanSchmidt1 commented 10 months ago

With distributed:true the training hangs at "building line graphs" even when only using one node. However there seem to be quite a lot of issues with hanging processes in accelerate. And unfortunately the kernel version of the supercomputing system I am on is 5.3.18 for which this seems to be a known issue https://huggingface.co/docs/accelerate/basic_tutorials/troubleshooting .

Did you have any reports of the training hanging at this stage?

JonathanSchmidt1 commented 6 months ago

@knc6 just a quick check in concerning data parallel as the distributed was removed. I am getting some device errors with dataparallel and I am also not sure whether dataparallel can properly split the batches for dgl graphs. dgl._ffi.base.DGLError: Cannot assign node feature "e_src" on device cuda:1 to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device.

Is the dataparallel training working for you? thank you so much!

knc6 commented 6 months ago

There is a WIP DistributedDataParallel implementation of ALIGNN in the ddp branch. Currently, it prints outputs multiple times instead, hopefully should be an easy fix. Also, I need to check the reproducibility and few other aspects of DDP.

JonathanSchmidt1 commented 6 months ago

Great, will try it out.

JonathanSchmidt1 commented 6 months ago

Thank for sharing the branch. I tested it with cached datasets and 2 gpus and it was reproducible and consistent with what I would expect from 1 gpu.

However I am noticing already that RAM will become an issue at some point (2 tasks with 400k materials is already 66 Gb, 8 tasks with 4M materials would be ~3.2Tb) due to the in-memory datasets.

knc6 commented 6 months ago

@JonathanSchmidt1 Thanks for checking. I have perhaps a possible solution to lessen memory footprint using LMDB dataset, will get back to you soon.

JonathanSchmidt1 commented 6 months ago

that's a good idea, lmdb datasets definitely work for this. If you would like to use lmdb datasets, there are a few examples of how to do lmdb datasets in e.g. https://github.com/IntelLabs/matsciml/tree/main for both dgl as well as pyg. I have also implemented one of them, unfortunately I just dont have the capacity right now to promise anything. However taking one of their classes that parse pmg structures, e.g. materials project dataset or alexandriadataset and switching to your line graph conversion might be easy. E.g. for m3gnet preprocessing they just use:

@registry.register_dataset("M3GMaterialsProjectDataset")
class M3GMaterialsProjectDataset(MaterialsProjectDataset):
    def __init__(
        self,
        lmdb_root_path: str | Path,
        threebody_cutoff: float = 4.0,
        cutoff_dist: float = 20.0,
        graph_labels: list[int | float] | None = None,
        transforms: list[Callable[..., Any]] | None = None,
    ):
        super().__init__(lmdb_root_path, transforms)
        self.threebody_cutoff = threebody_cutoff
        self.graph_labels = graph_labels
        self.cutoff_dist = cutoff_dist
        self.clear_processed = True

    def _parse_structure(
        self,
        data: dict[str, Any],
        return_dict: dict[str, Any],
    ) -> None:
        super()._parse_structure(data, return_dict)
        structure: None | Structure = data.get("structure", None)
        self.structures = [structure]
        self.converter = Structure2Graph(
            element_types=element_types(),
            cutoff=self.cutoff_dist,
        )
        graphs, lg, sa = M3GNetDataset.process(self)
        graphs, lattices, lg, sa = M3GNetDataset.process(self)
        return_dict["graph"] = graphs[0]
knc6 commented 6 months ago

@JonathanSchmidt1 I have implemented a basic LMDB dataset here, feel free to give a try.

JonathanSchmidt1 commented 6 months ago

Thank you very much. Will give it a try this week.