Open zz-cloud opened 3 years ago
We provide three options for multi GPU training:
DataParallel
, which requires you to swap out DataLoader
for DataListLoader
.DistributedDataParallel
, which follows PyTorch's design principles of distributed training (this one is actually preferred over DataParallel
as it is faster and works in a single machine/multi GPU setting as well)We provide three options for multi GPU training:
DataParallel
, which requires you to swap outDataLoader
forDataListLoader
.DistributedDataParallel
, which follows PyTorch's design principles of distributed training (this one is actually preferred overDataParallel
as it is faster and works in a single machine/multi GPU setting as well)- PyTorch Lightning: Probably the smoothest experience if you aim to minimize code changes going from single GPU to multi GPU training.
Thank you for your response! Now, To transfer from multi GPU to single GPU, just to delete the dataparallel/DistributedDataParallel function in the codes is necessary? Any other code changes exist? Thanks!
To go from multi GPU to single GPU, you can just execute your multi GPU training script using a single GPU.
When to use the function DataListLoader(DataLoader), I want to set the DataLoader parameter pin_memory=True to load data in GPU, how to set pin_memory=True? Thank you!
class DataListLoader(torch.utils.data.DataLoader):
r"""Data loader which merges data objects from a
:class:torch_geometric.data.dataset
to a python list.
.. note::
This data loader should be used for multi-gpu support via
:class:`torch_geometric.nn.DataParallel`.
Args:
dataset (Dataset): The dataset from which to load the data.
batch_size (int, optional): How many samples per batch to load.
(default: :obj:`1`)
shuffle (bool, optional): If set to :obj:`True`, the data will be
reshuffled at every epoch (default: :obj:`False`)
"""
def __init__(self, dataset, batch_size=1, shuffle=False, **kwargs):
super(DataListLoader,
self).__init__(dataset, batch_size, shuffle,
collate_fn=identity_collate, **kwargs)
pin_memory is a parameter of torch.utils.data.DataLoader, How to set it to True?
How does DataListloader transfer pin_memory parameter to torch.utils.data.DataLoader?
You can pass the pin_memory=True
option to DataListLoader
, which should work just fine for PyG>=1.7.0.
To use Special function or the source code already support? Thank you
---Original--- From: "Matthias @.> Date: Mon, May 17, 2021 13:56 PM To: @.>; Cc: @.**@.>; Subject: Re: [rusty1s/pytorch_geometric] Single or multi GPU running (#2529)
You can pass the pin_memory=True option to DataListLoader, which should work just fine for PyG>=1.7.0.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
To use Special function or the source code already support? Thank you
---Original--- From: "Matthias @.> Date: Mon, May 17, 2021 13:56 PM To: @.>; Cc: @.**@.>; Subject: Re: [rusty1s/pytorch_geometric] Single or multi GPU running (#2529)
You can pass the pin_memory=True option to DataListLoader, which should work just fine for PyG>=1.7.0.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
I'm not sure what you mean. Can you clarify?
the DataListLoader of PyG 1.7.0 has special interface to pass the pin_memory parameter to the torch.utils.data.DataLoader? Or the source code of DataListLoader of pyg1.7.0 version already contain pin_memory parameter? Thank you
---Original--- From: "Matthias @.> Date: Mon, May 17, 2021 17:36 PM To: @.>; Cc: @.**@.>; Subject: Re: [rusty1s/pytorch_geometric] Single or multi GPU running (#2529)
I'm not sure what you mean. Can you clarify?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
This argument gets passed to the DataLoader
via **kwargs
, which means you can pass any argument to the DataLoader that the PyTorch DataLoader can understand. However, in PyG 1.7.0, we added pin_memory
support to the data
object, so that PyTorch can automatically pin memory in case of pin_memory=True
.
thank you
---Original--- From: "Matthias @.> Date: Tue, May 18, 2021 14:39 PM To: @.>; Cc: @.**@.>; Subject: Re: [rusty1s/pytorch_geometric] Single or multi GPU running (#2529)
This argument gets passed to the DataLoader via **kwargs, which means you can pass any argument to the DataLoader that the PyTorch DataLoader can understand. However, in PyG 1.7.0, we added pin_memory support to the data object, so that PyTorch can automatically pin memory in case of pin_memory=True.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Is there any plan to introduce ddp strategy for lightning instead of ddpspawn? As ddpspawn is highly discouraged on lightning docs
ddp
cannot share data across processes. We will need to wait for https://github.com/pytorch/pytorch/issues/64932 in order to support ddp
strategy instead of ddpspawn
. I don't think the usage of ddpspawn
is bad though, as PyTorch itself uses its for all its DDP examples.
Hi @rusty1s , is the reason for using ddpspawn
due to the InMemoryDataset
classes related to the shared memory explained in https://pytorch-lightning.readthedocs.io/en/latest/advanced/training_tricks.html#ddp-spawn-shared-memory ?
If I create a torch.utils.data.Dataset
or rewrite torch_geometric.data.Dataset
slightly, and access the data within __getitem__
by loading from a LMDB database, as done here: https://github.com/drorlab/atom3d/blob/master/atom3d/datasets/datasets.py#L34-L118 - Can I use DDP from Pytorch-Lightning? I tried it, and the codes run, but I am not sure if there is some data leakage issue.
So now I have both versions with ddp_spawn
and ddp
while ddp_spawn
is slower... Essentially, my database is stored as one huge binary file where I access by index the data sample and encode them into torch_geometric.data.Data
(done previously when I created the database, i.e. each entry in the database is a compressed torch_geometric.data.Data
object).
Hi @rusty1s , is the reason for using ddpspawn due to the InMemoryDataset classes related to the shared memory explained in https://pytorch-lightning.readthedocs.io/en/latest/advanced/training_tricks.html#ddp-spawn-shared-memory ?
Yes, this is actually a note from me :)
If I create a torch.utils.data.Dataset or rewrite torch_geometric.data.Dataset slightly, and access the data within getitem by loading from a LMDB database, as done here: https://github.com/drorlab/atom3d/blob/master/atom3d/datasets/datasets.py#L34-L118 - Can I use DDP from Pytorch-Lightning? I tried it, and the codes run, but I am not sure if there is some data leakage issue.
As long as you do not hold all data in memory (as it is the case for our InMemoryDataset
), any DDP strategy works just fine. In your case, each replica will establish a connection to your lmdb
database and will read single entries from it during __getitem__
.
Awesome. Thanks for the quick answer, Matthias!
Sorry for my little understanding, can I use PyG in Spark environment?
I haven't used it there personally, but I see no reason why this shouldn't work. As a general rule of thumb: If PyTorch works in Spark, then it is expected that PyG works there as well. If it doesn't, this is definitely a bug we are interested to fix :)
Thanks @rusty1s. We will be glad to try it out. The question I have tho is really is the distributed GPU optimization discussed here mainly designed for the COO format? When I hook up something like Horovod, which is also a distributed deep learning platform, how is it gonna work?
It depends on what you want to achieve:
A deep learning model implemented using PyTorch geometrics works well on a CPU.
When it is tested on a single-node GPU, CUDA out of memory issue is encountered no matter inside Databricks or SageMaker platform.
Can anyone share some tutorials or examples about how to implement PyTorch Geometric utilizing Databricks PySpark for distributed training?
Thank you very much.
I don't have an example of utilizing DataBricks PySpark in combination with PyG. However, CUDA out-of-memory issues can be usually easily prevented, but it depends a bit on how your model and input data looks like. How big are your mini-batches (number of nodes/number of edges), and how deep is your GNN model?
From now on, we recommend using our discussion forum (https://github.com/rusty1s/pytorch_geometric/discussions) for general questions.
What's the differences to use PyG based on a single GPU or multi GPU ❓ Is Datalistloader designed for multi-gpu only? Data load problem? single gpu utilizing only around 2%, why? Questions & Help
When we use PyG to realize some existing projects , what's the differences to the codes based on single GPU and multi GPU, the codes running well on multi-gpu doesn't work on single gpu(already deleting the dataparallel function), any other codes to modify? thanks