Open aydinmirac opened 1 year ago
There is a config option to wrap the model in DistributedDataParallel, but I think that may be a dead code path at this point. https://github.com/usnistgov/alignn/blob/736ec739dfd697d64b1c2a01dc84678a24bcfacd/alignn/train.py#L651
I think the most straightforward path is through ignite.distributed. Lightning would probably be pretty straightforward to use as well if you wrap our models and dataloaders (I tried using lightning once but it's been a while).
I have some experimental code for training on multiple GPUs that I think hasn't gotten merged into the the main repo. This more or less followed the ignite.distributed CIFAR example. I don't think there was much required beyond wrapping the model, optimizer, and dataloader in ignite's auto distributed helpers, and fixing the training entry point to have the right call signature
Cleaning that up is actually on my near-term to-do list. Is there a particular form or use case for multi-GPU training/inference that you think would be useful?
Hi @bdecost,
Thank you so much for your reply.
Let me try your solution. I will inform you as soon as possible.
I am using 2xRTX-4500 and distributing data with the following PyTorch-Lightning example code snippet. It is very useful and simple:
trainer = pl.Trainer(
callbacks=callbacks,
logger=logger,
default_root_dir=root_dir,
max_epochs=200,
devices=2,
accelerator="gpu",
strategy="ddp"
)
trainer.fit(task, datamodule=data)
My model does not fit into single GPU because of big molecule structures. These use cases are pretty common if you are studying big structures. I think adding a feature like that (I mean, being able to split a model easily) would be very helpful when using ALIGNN.
My model does not fit into single GPU because of big molecule structures. These use cases are pretty common if you are studying big structures. I think adding a feature like that (I mean, being able to split a model easily) would be very helpful when using ALIGNN.
Are you looking to shard your model across multiple GPUs? In that case lightning looks like it might be able to set that up automatically? I've only tried using data parallelism with ignite to get better throughput, I'm not sure ignite supports model parallelism.
What kind of batch sizes are you working with? I'm just wondering if you could reduce your batch size and model size to work within your GPU memory limit, and maybe use data parallelism to get higher throughput / effective batch size. The layer configuration study we did indicates that you can drop to a configuration with 2 ALIGNN and 2 GCN layers without really sacrificing performance, and this substantially reduces the model size (not quite by a factor of 2). See also the cost vs accuracy summary of that experiment. I'm not sure this tradeoff would hold for larger molecular structures since the performance tradeoff study was done using Jarvis dft_3d dataset, but it might be worth trying.
Hi @bdecost,
Let me elaborate my case.
I am using OMDB dataset which includes big molecules that has 82 atoms per molecule on average. That is quite bigger than QM9 dataset. As a model, I am using SchNetPack and ALIGNN to train the dataset.
I trained Schnet and ALIGNN on A100 (40GB GPU memory) before. Batch size and GPU memory usage on A100 are below:
Schnet --> batch size = 64, Memory usage = 39GB ALIGNN --> batch size = 48, Memory usage = 32GB
If I increase batch size, both models crash on A100. Now I have to train these models on 2xRTX A4500 (20GB memory each) and split them into 2 GPUs without out of memory problem. Schnet uses Lightning to eliminate this issue.
I just wondered if I can do this with ALIGNN. Otherwise, as you mentioned, I have to reduce batch size. Also I will try your suggestions as you mentioned above.
Ok, cool. I wasn't sure if you had such large molecules that fitting a single instance in GPU memory was a problem.
What you want is more straightforward to do than model parallelism I think. Our current training setup doesn't make it as simple as a configuration setting, but I think it would be nice for us to support
We are also interested in training on some larger datasets. How is the state of distributed training right now?
Hi,
did you try distributed:true
which is based on accelerate package to manage multi-gpu training?
With distributed:true the training hangs at "building line graphs" even when only using one node. However there seem to be quite a lot of issues with hanging processes in accelerate. And unfortunately the kernel version of the supercomputing system I am on is 5.3.18 for which this seems to be a known issue https://huggingface.co/docs/accelerate/basic_tutorials/troubleshooting .
Did you have any reports of the training hanging at this stage?
@knc6 just a quick check in concerning data parallel as the distributed was removed.
I am getting some device errors with dataparallel and I am also not sure whether dataparallel can properly split the batches for dgl graphs.
dgl._ffi.base.DGLError: Cannot assign node feature "e_src" on device cuda:1 to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device.
Is the dataparallel training working for you? thank you so much!
There is a WIP DistributedDataParallel implementation of ALIGNN in the ddp branch. Currently, it prints outputs multiple times instead, hopefully should be an easy fix. Also, I need to check the reproducibility and few other aspects of DDP.
Great, will try it out.
Thank for sharing the branch. I tested it with cached datasets and 2 gpus and it was reproducible and consistent with what I would expect from 1 gpu.
However I am noticing already that RAM will become an issue at some point (2 tasks with 400k materials is already 66 Gb, 8 tasks with 4M materials would be ~3.2Tb) due to the in-memory datasets.
@JonathanSchmidt1 Thanks for checking. I have perhaps a possible solution to lessen memory footprint using LMDB dataset, will get back to you soon.
that's a good idea, lmdb datasets definitely work for this. If you would like to use lmdb datasets, there are a few examples of how to do lmdb datasets in e.g. https://github.com/IntelLabs/matsciml/tree/main for both dgl as well as pyg. I have also implemented one of them, unfortunately I just dont have the capacity right now to promise anything. However taking one of their classes that parse pmg structures, e.g. materials project dataset or alexandriadataset and switching to your line graph conversion might be easy. E.g. for m3gnet preprocessing they just use:
@registry.register_dataset("M3GMaterialsProjectDataset")
class M3GMaterialsProjectDataset(MaterialsProjectDataset):
def __init__(
self,
lmdb_root_path: str | Path,
threebody_cutoff: float = 4.0,
cutoff_dist: float = 20.0,
graph_labels: list[int | float] | None = None,
transforms: list[Callable[..., Any]] | None = None,
):
super().__init__(lmdb_root_path, transforms)
self.threebody_cutoff = threebody_cutoff
self.graph_labels = graph_labels
self.cutoff_dist = cutoff_dist
self.clear_processed = True
def _parse_structure(
self,
data: dict[str, Any],
return_dict: dict[str, Any],
) -> None:
super()._parse_structure(data, return_dict)
structure: None | Structure = data.get("structure", None)
self.structures = [structure]
self.converter = Structure2Graph(
element_types=element_types(),
cutoff=self.cutoff_dist,
)
graphs, lg, sa = M3GNetDataset.process(self)
graphs, lattices, lg, sa = M3GNetDataset.process(self)
return_dict["graph"] = graphs[0]
@JonathanSchmidt1 I have implemented a basic LMDB dataset here, feel free to give a try.
Thank you very much. Will give it a try this week.
Dear All,
I would like to run ALIGNN on multi GPUs. When I checked the code I could not find any option.
Is there any method to run ALIGNN on multi GPUs such as using PyTorch Lightning or DDP function from PyTorch (Distributed Data Parallel)?
Best regards, Mirac