Compatibility of Darts with CUDA 10.1 and Differences in Usage from the Current Version

1404971870 commented 1 year ago

Hello,

I hope you are doing well. I have encountered an issue while trying to use the latest version of Darts on my laboratory server, which has CUDA version 10.1. Whenever I attempt to train the model on multiple GPUs, I encounter a "RuntimeError: No precision set" error. Due to this limitation, I am interested in finding out which specific version of Darts is compatible with CUDA 10.1 and how it differs in usage from the current version.

I would be grateful if you could provide me with the following information:

Which version of Darts is known to be compatible with CUDA 10.1? What are the specific differences in usage between the version compatible with CUDA 10.1 and the latest version that might be causing the "RuntimeError: No precision set" issue? Are there any steps or considerations I need to take into account while using the version compatible with CUDA 10.1 on multiple GPUs? Your insights and assistance on this matter would be immensely helpful to me. Thank you very much for your time and support！

madtoinou commented 1 year ago

Hi @1404971870,

Darts relies on Pytorch Lightning for all the torch-based models, which should automatically take care of the abstraction related to CUDA. Which version of darts are you using? And which version of Pytorch Lighning is installed in your environment? If it's not 2.0 or above, I would recommend upgrading it.

Moreover, the precision of a model in darts is based on the dtype of the TimeSeries used as target when calling fit(). It's possible to overwrite it by passing pl_trainer_kwargs={"precision":"64" or "32"} to the model constructor. Enforcing a specific precision might solve your problem?

Let me know if one of these two approaches solve your problem, I won't be able to answer your questions related to CUDA 10.1 as I could not find relevant information in Pytorch Lightning documentation.

1404971870 commented 1 year ago

Hi @1404971870,

Darts relies on Pytorch Lightning for all the torch-based models, which should automatically take care of the abstraction related to CUDA. Which version of darts are you using? And which version of Pytorch Lighning is installed in your environment? If it's not 2.0 or above, I would recommend upgrading it.

Moreover, the precision of a model in darts is based on the dtype of the TimeSeries used as target when calling fit(). It's possible to overwrite it by passing pl_trainer_kwargs={"precision":"64" or "32"} to the model constructor. Enforcing a specific precision might solve your problem?

Let me know if one of these two approaches solve your problem, I won't be able to answer your questions related to CUDA 10.1 as I could not find relevant information in Pytorch Lightning documentation.

I have built a Docker environment with torch==1.12.1+cu102, torchvision==0.13.1+cu102, torchaudio==0.12.1, and pytorch-lightning version 2.0.6. However, I noticed that some parameters have changed when setting pl_trainer_kwargs. For example, "devices" is now used instead of the previous "gpus". I wonder if there is a tutorial available for multi-GPU training with the new version of pytorch-lightning.

Regardless of whether I am using the old or new version of pytorch-lightning, I encounter a RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3. The error message suggests that it may be due to a system call failure or device error, or the unexpected exit of a remote peer. I should check NCCL warnings to determine the exact cause and whether there is a connection closure by a peer.

Do you have any solutions to resolve this issue?

madtoinou commented 1 year ago

I could not find any tutorial with the latest version of pytorch-lightning but based on the documentation, it seems rather straightforward.

I don't have the hardware to try to reproduce your issue but I know that @solalatus have some experience with using darts on multiple GPUs. It's the first time I see such an error reported here.

solalatus commented 1 year ago

I have to admit, this error never occurred to me. Looks like a Lightning or even Torch level problem to me.

unit8co / darts

Compatibility of Darts with CUDA 10.1 and Differences in Usage from the Current Version #1931