pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.37k stars 3.66k forks source link

Single or multi GPU running #2529

Open zz-cloud opened 3 years ago

zz-cloud commented 3 years ago

From now on, we recommend using our discussion forum (https://github.com/rusty1s/pytorch_geometric/discussions) for general questions.

What's the differences to use PyG based on a single GPU or multi GPU ❓ Is Datalistloader designed for multi-gpu only? Data load problem? single gpu utilizing only around 2%, why? Questions & Help

When we use PyG to realize some existing projects , what's the differences to the codes based on single GPU and multi GPU, the codes running well on multi-gpu doesn't work on single gpu(already deleting the dataparallel function), any other codes to modify? thanks

rusty1s commented 3 years ago

We provide three options for multi GPU training:

  1. DataParallel, which requires you to swap out DataLoader for DataListLoader.
  2. DistributedDataParallel, which follows PyTorch's design principles of distributed training (this one is actually preferred over DataParallel as it is faster and works in a single machine/multi GPU setting as well)
  3. PyTorch Lightning: Probably the smoothest experience if you aim to minimize code changes going from single GPU to multi GPU training.
zz-cloud commented 3 years ago

We provide three options for multi GPU training:

  1. DataParallel, which requires you to swap out DataLoader for DataListLoader.
  2. DistributedDataParallel, which follows PyTorch's design principles of distributed training (this one is actually preferred over DataParallel as it is faster and works in a single machine/multi GPU setting as well)
  3. PyTorch Lightning: Probably the smoothest experience if you aim to minimize code changes going from single GPU to multi GPU training.

Thank you for your response! Now, To transfer from multi GPU to single GPU, just to delete the dataparallel/DistributedDataParallel function in the codes is necessary? Any other code changes exist? Thanks!

rusty1s commented 3 years ago

To go from multi GPU to single GPU, you can just execute your multi GPU training script using a single GPU.

zz-cloud commented 3 years ago

When to use the function DataListLoader(DataLoader), I want to set the DataLoader parameter pin_memory=True to load data in GPU, how to set pin_memory=True? Thank you!

zz-cloud commented 3 years ago

class DataListLoader(torch.utils.data.DataLoader): r"""Data loader which merges data objects from a :class:torch_geometric.data.dataset to a python list.

.. note::

    This data loader should be used for multi-gpu support via
    :class:`torch_geometric.nn.DataParallel`.

Args:
    dataset (Dataset): The dataset from which to load the data.
    batch_size (int, optional): How many samples per batch to load.
        (default: :obj:`1`)
    shuffle (bool, optional): If set to :obj:`True`, the data will be
        reshuffled at every epoch (default: :obj:`False`)
"""
def __init__(self, dataset, batch_size=1, shuffle=False, **kwargs):
    super(DataListLoader,
          self).__init__(dataset, batch_size, shuffle,
                         collate_fn=identity_collate, **kwargs)

pin_memory is a parameter of torch.utils.data.DataLoader, How to set it to True?

zz-cloud commented 3 years ago

How does DataListloader transfer pin_memory parameter to torch.utils.data.DataLoader?

rusty1s commented 3 years ago

You can pass the pin_memory=True option to DataListLoader, which should work just fine for PyG>=1.7.0.

zz-cloud commented 3 years ago

To use Special function or the source code already support?  Thank you

---Original--- From: "Matthias @.> Date: Mon, May 17, 2021 13:56 PM To: @.>; Cc: @.**@.>; Subject: Re: [rusty1s/pytorch_geometric] Single or multi GPU running (#2529)

You can pass the pin_memory=True option to DataListLoader, which should work just fine for PyG>=1.7.0.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

zz-cloud commented 3 years ago

To use Special function or the source code already support?  Thank you

---Original--- From: "Matthias @.> Date: Mon, May 17, 2021 13:56 PM To: @.>; Cc: @.**@.>; Subject: Re: [rusty1s/pytorch_geometric] Single or multi GPU running (#2529)

You can pass the pin_memory=True option to DataListLoader, which should work just fine for PyG>=1.7.0.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

rusty1s commented 3 years ago

I'm not sure what you mean. Can you clarify?

zz-cloud commented 3 years ago

the DataListLoader of PyG 1.7.0 has special  interface to pass the pin_memory parameter to the torch.utils.data.DataLoader? Or  the source code of DataListLoader of pyg1.7.0 version already contain pin_memory parameter? Thank you

---Original--- From: "Matthias @.> Date: Mon, May 17, 2021 17:36 PM To: @.>; Cc: @.**@.>; Subject: Re: [rusty1s/pytorch_geometric] Single or multi GPU running (#2529)

I'm not sure what you mean. Can you clarify?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

rusty1s commented 3 years ago

This argument gets passed to the DataLoader via **kwargs, which means you can pass any argument to the DataLoader that the PyTorch DataLoader can understand. However, in PyG 1.7.0, we added pin_memory support to the data object, so that PyTorch can automatically pin memory in case of pin_memory=True.

zz-cloud commented 3 years ago

thank you

---Original--- From: "Matthias @.> Date: Tue, May 18, 2021 14:39 PM To: @.>; Cc: @.**@.>; Subject: Re: [rusty1s/pytorch_geometric] Single or multi GPU running (#2529)

This argument gets passed to the DataLoader via **kwargs, which means you can pass any argument to the DataLoader that the PyTorch DataLoader can understand. However, in PyG 1.7.0, we added pin_memory support to the data object, so that PyTorch can automatically pin memory in case of pin_memory=True.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

ItamarChinn commented 2 years ago

Is there any plan to introduce ddp strategy for lightning instead of ddpspawn? As ddpspawn is highly discouraged on lightning docs

rusty1s commented 2 years ago

ddp cannot share data across processes. We will need to wait for https://github.com/pytorch/pytorch/issues/64932 in order to support ddp strategy instead of ddpspawn. I don't think the usage of ddpspawn is bad though, as PyTorch itself uses its for all its DDP examples.

tuanle618 commented 2 years ago

Hi @rusty1s , is the reason for using ddpspawn due to the InMemoryDataset classes related to the shared memory explained in https://pytorch-lightning.readthedocs.io/en/latest/advanced/training_tricks.html#ddp-spawn-shared-memory ?

If I create a torch.utils.data.Dataset or rewrite torch_geometric.data.Dataset slightly, and access the data within __getitem__ by loading from a LMDB database, as done here: https://github.com/drorlab/atom3d/blob/master/atom3d/datasets/datasets.py#L34-L118 - Can I use DDP from Pytorch-Lightning? I tried it, and the codes run, but I am not sure if there is some data leakage issue.

So now I have both versions with ddp_spawn and ddp while ddp_spawn is slower... Essentially, my database is stored as one huge binary file where I access by index the data sample and encode them into torch_geometric.data.Data (done previously when I created the database, i.e. each entry in the database is a compressed torch_geometric.data.Data object).

rusty1s commented 2 years ago

Hi @rusty1s , is the reason for using ddpspawn due to the InMemoryDataset classes related to the shared memory explained in https://pytorch-lightning.readthedocs.io/en/latest/advanced/training_tricks.html#ddp-spawn-shared-memory ?

Yes, this is actually a note from me :)

If I create a torch.utils.data.Dataset or rewrite torch_geometric.data.Dataset slightly, and access the data within getitem by loading from a LMDB database, as done here: https://github.com/drorlab/atom3d/blob/master/atom3d/datasets/datasets.py#L34-L118 - Can I use DDP from Pytorch-Lightning? I tried it, and the codes run, but I am not sure if there is some data leakage issue.

As long as you do not hold all data in memory (as it is the case for our InMemoryDataset), any DDP strategy works just fine. In your case, each replica will establish a connection to your lmdb database and will read single entries from it during __getitem__.

tuanle618 commented 2 years ago

Awesome. Thanks for the quick answer, Matthias!

rwforest commented 2 years ago

Sorry for my little understanding, can I use PyG in Spark environment?

rusty1s commented 2 years ago

I haven't used it there personally, but I see no reason why this shouldn't work. As a general rule of thumb: If PyTorch works in Spark, then it is expected that PyG works there as well. If it doesn't, this is definitely a bug we are interested to fix :)

rwforest commented 2 years ago

Thanks @rusty1s. We will be glad to try it out. The question I have tho is really is the distributed GPU optimization discussed here mainly designed for the COO format? When I hook up something like Horovod, which is also a distributed deep learning platform, how is it gonna work?

rusty1s commented 2 years ago

It depends on what you want to achieve:

XinQi7788 commented 1 year ago

A deep learning model implemented using PyTorch geometrics works well on a CPU.

When it is tested on a single-node GPU, CUDA out of memory issue is encountered no matter inside Databricks or SageMaker platform.

Can anyone share some tutorials or examples about how to implement PyTorch Geometric utilizing Databricks PySpark for distributed training?

Thank you very much.

rusty1s commented 1 year ago

I don't have an example of utilizing DataBricks PySpark in combination with PyG. However, CUDA out-of-memory issues can be usually easily prevented, but it depends a bit on how your model and input data looks like. How big are your mini-batches (number of nodes/number of edges), and how deep is your GNN model?