microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.69k stars 4.04k forks source link

CPUAdam does not find CUDA #1620

Closed javier-alvarez closed 2 years ago

javier-alvarez commented 2 years ago

Discussed in https://github.com/microsoft/DeepSpeed/discussions/1619

Originally posted by **javier-alvarez** December 8, 2021 2021-12-08T15:12:02Z INFO Switching optimizer to DeepSpeedCPUAdam No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' [stderr]Traceback (most recent call last): [stderr] File "InnerEyePrivate/ML/runner.py", line 57, in [stderr] main() [stderr] File "InnerEyePrivate/ML/runner.py", line 53, in main [stderr] post_cross_validation_hook=runner.default_post_cross_validation_hook) [stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/runner.py", line 442, in run [stderr] return runner.run() [stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/runner.py", line 219, in run [stderr] self.run_in_situ(azure_run_info) [stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/runner.py", line 398, in run_in_situ [stderr] self.ml_runner.run() [stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/run_ml.py", line 327, in run [stderr] num_nodes=self.azure_config.num_nodes) [stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/model_training.py", line 263, in model_train [stderr] trainer.fit(lightning_model, datamodule=data_module) [stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit [stderr] self._run(model) [stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 717, in _run [stderr] self.accelerator.setup(self, model) # note: this sets up self.lightning_module [stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/accelerators/cpu.py", line 39, in setup [stderr] return super().setup(trainer, model) [stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in setup [stderr] self.setup_optimizers(trainer) [stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 375, in setup_optimizers [stderr] trainer=trainer, model=self.lightning_module [stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 190, in init_optimizers [stderr] return trainer.init_optimizers(model) [stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/trainer/optimizers.py", line 34, in init_optimizers [stderr] optim_conf = model.configure_optimizers() [stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/SSL/lightning_modules/simclr_module.py", line 68, in configure_optimizers [stderr] deepspeed_optim = DeepSpeedCPUAdam(params, lr=self.learning_rate, weight_decay=self.weight_decay) [stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/deepspeed/ops/adam/cpu_adam.py", line 83, in __init__ [stderr] self.ds_opt_adam = CPUAdamBuilder().load() [stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/deepspeed/ops/op_builder/builder.py", line 370, in load [stderr] return self.jit_load(verbose) [stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/deepspeed/ops/op_builder/builder.py", line 385, in jit_load [stderr] assert_no_cuda_mismatch() [stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/deepspeed/ops/op_builder/builder.py", line 97, in assert_no_cuda_mismatch [stderr] f"Installed CUDA version {sys_cuda_version} does not match the " [stderr]Exception: Installed CUDA version 10.2 does not match the version torch was compiled with 11.1, unable to compile cuda/cpp extensions without a matching cuda version. [stderr] https://github.com/microsoft/InnerEye-DeepLearning/pull/611/files Any ideas why this does not find CUDA 11? It installs pytorch 1.8 and cuda 11 with conda Thanks!
jeffra commented 2 years ago

Hi @javier-alvarez,

Do you have interactive access to the machine you're running on here? if so can you show me the results of ds_report?

[stderr]Exception: Installed CUDA version 10.2 does not match the version torch was compiled with 11.1, unable to compile cuda/cpp extensions without a matching cuda version.

The above error is thrown as a safety precaution, what DeepSpeed is observing is that nvcc is reporting CUDA 10.2 but the installed version of torch was compiled with CUDA 11.1. DeepSpeed uses nvcc to compile some of our custom c++/cuda ops at runtime, if the version of nvcc and torch don't align then the ops will not run properly.

We pick up the nvcc path from torch.utils.cpp_extension.CUDA_HOME, if this path isn't the correct path for your environment then there might be issues.

jeffra commented 2 years ago

Was able to reproduce the issue with your conda environment, after adding cudatoolkit-dev=11.1.1 to your conda dependencies it seems to have resolved the issue on my side.

javier-alvarez commented 2 years ago

This fixed the issue. I have changed the Azure ML image to:

"mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu18.04"

It looks like having both cuda 10.2 and 11.1 does not work.