pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
80.1k stars 21.54k forks source link

[POLL][RFC] DataParallel Deprecation #65936

Open cbalioglu opened 2 years ago

cbalioglu commented 2 years ago

Hey everyone,

We wanted to let you know that we are considering the deprecation of DataParallel (torch.nn.DataParallel, a.k.a. DP) module with the upcoming v1.11 release of PyTorch. Our plan is to keep DataParallel in maintenance mode for the 12 months following the v1.11 release and afterwards completely remove it from our code base. No matter whether we follow this plan or not, we still highly encourage everyone relying on DataParallel to onboard with the Distributed Data Parallel (torch.distributed, a.k.a. DDP) module as we consider it the future of PyTorch Distributed. Your feedback is very important to us, so please let us know your thoughts.

--

Why should I use DDP over DP? DDP is a faster and scalable alternative to DP. It is in active development, has a clear roadmap, and represents the future of PyTorch Distributed. DP was our first attempt to data parallel training, but over time due to its inefficiency and lack of some key features such as multi-node training fell into disuse. Today a big majority of our users have already migrated to DDP. We ask our remaining users to migrate in the near future so that we can dedicate all our resources to DDP.

I use DP on a single multi-GPU machine. Why do I to have to use a “distributed” API? DDP can be used on a single machine as well. In fact it is faster than DP on a single machine with multiple GPUs since it avoids Python’s GIL.

How is DDP faster than DP? DP uses multi-threaded parallelism, while DDP uses multi-process parallelism. Due to Python’s infamous GIL contention across threads, the collective operations of DP introduce a lot of overhead and are universally slower than their DDP counterparts. Moreover DP repeatedly copies model parameters between the threads which introduces yet another overhead.

But DP is a lot easier to use, DDP requires me to write additional code to set it up. Is there a way to avoid it? We are actively working on filling any usability gaps between the two APIs and aim to have parity in the following months before the removal of DP. Also note that DDP comes with a launcher script (see here) that handles most of the boilerplate code. In the meantime if you have any suggestions, feel free to reach out to us.

--

If you have any specific questions, please let us know below. Going forward we will actively monitor this post and try to address any concerns, suggestions, or feedback you have.

Thank you, PyTorch Distributed Team

cc @mrshenli @pritamdamania @rohan-varma @vitaly-fedyunin @ngimel @bowangbj

ananthsub commented 2 years ago

Also note that DDP comes with a launcher script (see here) that handles most of the boilerplate code.

@cbalioglu another popular use case for DataParallel is within Jupyter notebooks (based on what we've seen for PyTorch Lightning users). Is there a notebook launcher that could be supported as well? https://discuss.pytorch.org/t/distributeddataparallel-on-terminal-vs-jupyter-notebook/101404

cc @awaelchli

h6197627 commented 2 years ago

Another question what is the real state with #35191? Is it not relevant anymore and warning is obsolete?

vadimkantorov commented 2 years ago

One other usecase for DP versus DDP is interactive debugging by setting breakpoint() in model code

tstandley commented 2 years ago

I feel like it's more difficult to make sure DDP code is working correctly. Especially if there isn't DP to check loss curves against.

For example, when you have multiple processes feeding training data to a neural network, you really want to make sure they are correctly partitioning the data for each epoch. This happens for sure with DP. With DDP, you have to trust that everything is working correctly with your DataLoader. I know pytorch includes the DistributedSampler module, but I always feel like I have to take it on faith that I'm using it right. There really is no way to be sure. And it's not like there are no gotchas. You have to also remember to sampler.set_epoch(...) otherwise your code will silently be doing the wrong thing. I realized about a year ago that all of my DDP models were trained without setting the sampler epoch. It doesn't make a huge difference, but it's tangible when comparing to performance metrics of other papers.

Are you sure you can be done with the zombie process problem of DDP? When I am prototyping, it's a major pain to have to deal with that every time I press ctrl-c.

I also feel like there are issues with nccl hanging when you want to synchronize. I've had such headaches trying to get this to work. And sometimes the hangs happen later, after much training time has been invested already. Sometimes it just stops working too. Like you change one unrelated line and your code starts hanging. Often I give up and just have to know that the training statistics are only for one process. In fact, I claim that almost every github repo I find that uses DDP does validation either on only one GPU, or incorrectly. Either they are only showing the results from a fraction of their validation data, or they have every GPU do validation on the entire set which is slower and wastes electricity and ram. Sometimes you even see people pause the other processes because they realize this is happening and they can't fix it the right way.

It's not just a few lines of extra code to make it work right. It's at least a dozen and they are finicky lines. I don't have a lot of confidence that the API parity will actually be parity.

I still don't understand why DP can't be written to avoid unnecessary parameter copies. The simplicity of DP is worth some love.

Finally, DDP uses an extra CPU core per GPU that you have. This isn't a big deal in a lot of settings. Professional systems have a lot of slow cores. But a lot of us use consumer CPUs with fewer, faster cores and as many GPUS as we can pack it with. These systems can actually run slower with DDP because dataloading can be starved for resources, and the main process is always waiting on GPU operations so it isn't the bottleneck.

For what it's worth, I've switched to using DDP most of the time. But DDP isn't always better. It's a big pain to use, it's harder to be sure things are correct, and you don't always get a tangible benefit.

ngimel commented 2 years ago

Thank you for your feedback @tstandley. I see a few actionable items here 1) Improve the documentation for DistributedSampler, possibly enable some warnings when set_epoch or random seed appears to be set incorrectly cc @ejguan 2) Gracefully handle zombie processes from Ctrl-C-int ddp cc @mrshenli, @cbalioglu 3) Guidelines and utilities for correctly doing multi-gpu validation 4) I'm not sure nccl hangs are actually actionable, but cc @mrshenli just in case As for using extra CPU core per GPU, DP spawns extra threads, so in terms of CPU usage it should not amount to a lot of difference

vadimkantorov commented 2 years ago

DistributedSampler should support arbitrary underlying samplers, I think now everyone copies that piece of wrapper code from catalyst :(

ngimel commented 2 years ago

@vadimkantorov can you please file an issue requesting additional DistributedSampler functionality?

vadimkantorov commented 2 years ago

It's at https://github.com/pytorch/pytorch/issues/23430 filed by @fmassa in 2019 :)

vadimkantorov commented 2 years ago

One other issue may be CPU thread over-subscribing especially in data-loader threads. One case was in https://github.com/pytorch/pytorch/issues/42959 even without data-loader threads. With data-loader threads this may become even more acute, requiring recommendations of always setting or automatically adjusting the number of CPU threads in every data-loader-created replica (especially given the fact that every replica has its own data loader as well each having multiple threads)

vadimkantorov commented 2 years ago

For distributed evaluators, it may be convenient to introduce synchronizable/concatenatable/mergeable data structures such as dict or list (or helper methods for them), e.g. so that the users don't have to figure out correct way of doing gather for a particular data structure.

vadimkantorov commented 2 years ago

@ngimel Here's a recent example of CPU thread oversubscription: https://github.com/pytorch/pytorch/issues/66249#issuecomment-939207348

vadimkantorov commented 2 years ago

@ngimel My related issue on guidance for implementing dataset evaluators: https://github.com/pytorch/pytorch/issues/64905

This is especially an issue because most of (often legacy) code for existing datasets was often written without configurable multi-process parallelization in mind

vadimkantorov commented 2 years ago

@ngimel Problems with dataset object being slowly copied in all data loader threads are also probably more acute for DDP mode: https://github.com/pytorch/pytorch/issues/13246

vadimkantorov commented 2 years ago

E.g. popular Deformable-DETR distributed scripts lead to dangling replicas and defunct processes if there are some exceptions at the beginning of training.

With DDP it's then super-hard to kill all remaining replicas without touching some other running pythons

https://github.com/fundamentalvision/Deformable-DETR/blob/main/tools/run_dist_launch.sh

Another useful primitive/util in core may be a function to sync/average/scatter model params from all replicas to ensure to divergence because of non-determinism

zhmlcg commented 2 years ago

I am just starting to learn PyTorch by reading the book "Deep Learning with PyTorch: Build, train, and tune neural networks using Python tools". Below is a paragraph on page 12 of the book:

It’s increasingly common to use more elaborate hardware like multiple GPUs or multiple machines that contribute their resources to training a large model, as seen in the bottom center of figure 1.2. In those cases, torch.nn.parallel.DistributedDataParallel and the torch.distributed submodule can be employed to use the additional hardware.

torch.nn.parallel.DistributedDataParallel is the DDP, right? Since it is mentioned in the text, I believe it is the technique I should learn and apply. So, don't hesitate and go ahead.

The only things I'm expecting are: 1) the new DDP is easy to install, no strange restrictions or complaints or even segment faults, 2) Sadly, The aforementioned book does not cover DDP so far as I know, so I hope the document of DDP is easy to read (like an elaborate textbook for beginners. Now I just have no idea what this intro is talking about), and the sample is easy to obtain (just written on the doc or intro webpage) and easy to run (no more than 10 lines, can run without errors simply by Ctrl-C and Ctrl-V in python CLI without knowing what they are).

la-cruche commented 2 years ago

please consider keeping an API for single-process single-node multi-device training.

TLDR: the torch.nn.DataParallel API fits a much demanded use-case and is really easy to use. Instead of deprecating it because of poor performance caused by internals, I would prefer it to stay, and its internals improved.

single-node multi-device training is the first thing people do when distributing code because (1) latencies and throughput are often better within node than across, (2) it is easy to write and debug, (3) single-node compute power is constantly increasing, reducing the need to train over multiple machines (a single AWS p4d.24xlarge could handle what people used to do on multiple P2s years ago). Not everybody needs to train GPTs, and my experience is that more and more DL workloads can train within a big multi-GPU node.

It seems brutal to deprecate the torch.nn.DataParallel front-end because of its backend performance, especially given how dramatically more complex is the proposed DDP replacement, at least in its current shape. Deprecating torch.nn.DataParallel in favor of DDP-for-everything is actually a step backward in terms of usability. Who dreams about configuring and handling process groups, MPI, launch.py/torchrun utils and setting ranks manually in each process? Look at the success of Hugging Face or PySpark. Developers want compact, single-script APIs that abstract the complexity. MXNet and Keras both have/had single-node single-script data parallelism for several years (Keras multi_gpu_model was sadly deprecated by Tensorflow). though I don't know how their performance would rank vs DDP and Horovod. But I don't recall Gluon documentation warning about GPU memory or GIL problems or even discourage its own use like DataParallel doc does.

I think your goal should be to produce a dist SGD experience similar to what PySpark achieved for distributed tabular data processing. Think about how easy it is to write a single-file PySpark data processing code and run it over multiple machines. Things written in PySpark natively parallelize over available hardware (though in practice Spark fine-tuning is far from smooth and requires both knowledge and trial-and-error, yet is optional) Why things aren’t this simpler in the PyTorch world yet? We’re clearly lagging behind the big data world in terms of abstracting complexity. In my opinion, the current torch.nn.DataParallel API is perfect and aligned with that need for more simplicity (one import, one model wrap and you're good), its problem is just its back-end that needs some optimization. Ideally, customer have the API experience of torch.nn.DataParallel with the perf of DDP.

edited on 10/29: added a TLDR & tampered my MXNet comment, since apparently single-process multi-device MXNet also suffers from the gil ; though I didn't see that highlighted/warned in their doc)

aluo-x commented 2 years ago

There are still two minor complications for DDP that I encounter:

  1. There is no official support for higher order gradients - important for GAN training with gradient penalty. See #63929
  2. Sometimes DDP processes are not cleanly shutdown with Ctrl + C, which requires pkill -f "filename.py" - also noted in two comments above

I'm still in favor of maintaining DataParallel, for backwards compatibility with the huge amount of existing pytorch code if nothing else.

vadimkantorov commented 2 years ago

It would be nice to have random port selection by default: https://github.com/pytorch/pytorch/issues/73320, without having to set up num_proc_per_node beforehand and without passing rdv-related changes (as proposed in https://github.com/pytorch/pytorch/pull/73598)

micahcc commented 9 months ago

FWIW Today, on version 2.1, I was pained by debugging with DDP (multiprocessing) and turned my trainer back to DataParallel and got a 25% training time speedup. For very straightforward neural networks I don't know that DDP makes sense. I can't share code because its internal, but if you take the torchvision detector trainers and test them I suspect you would find something similar.

The docs for DDP say:

In fact it is faster than DP on a single machine with multiple GPUs since it avoids Python's GIL.

I would like to see some numbers to back this up. Neural nets range from number recognition that can have batch size of 10,000 all the way to LLMS that don't even fit in a single GPU; the bottlenecks vary greatly. Any claim like this, and IMO deletion should be accompanied with solid proof that shows a bar plot of Torchvision + TorchAudio + ... ? trainers demonstrating A/B comparisons.

ex3ndr commented 6 months ago

In my learning experience with two RTX 4090 without P2P support DP is much better and easier to use than DDP, which is really beneficial only when you have P2P. Also as a casual trainer DP is much easier to setup, unlike DDP.