Any documentation to use Deepspeed to schedule training on Azure compute cluster ?

manchandasahil commented 3 years ago

I am trying to schedule a bert-large mlm training on azure compute cluster using Deepspeed. Seems like deepspeed needs ssh capabilities with the VMs but our data is Huge. hence, we need some idea of how to launch deepspeed on a Azure compute cluster.

tjruwase commented 3 years ago

@manchandasahil, have you seen this tutorial: https://www.deepspeed.ai/tutorials/azure/

awan-10 commented 3 years ago

Hi @manchandasahil, if Azure ML works for you, we have a short video about DeepSpeed on AML here: https://www.youtube.com/watch?v=yBVXR8G8Bg8&feature=youtu.be

We are also part of:

Azure ML Curated Environments: https://docs.microsoft.com/en-us/azure/machine-learning/resource-curated-environments#deepspeed
Azure ML examples: https://github.com/Azure/azureml-examples/tree/main/workflows/train/deepspeed/cifar

manchandasahil commented 3 years ago

@awan-10 Thankyou so much for your tutorial i followed it and was able to run deepspeed on a single node but as i increase the number of nodes i have the following:

[2021-02-18T11:48:40.368929] Writing error with error_code UserError to hosttool error file located at /mnt/batch/tasks/workitems/1e9edaa3-985f-4e6b-895d-025a0c3283eb/job-1/bert-pretraining-ds-_ccbd9647-0f77-4d0b-8e47-6db683e1d7f3/wd/runTaskLetTask_error.json Starting the daemon thread to refresh tokens in background for process with pid = 1733 Traceback (most recent call last): File "run_mlm.py", line 661, in main() File "run_mlm.py", line 204, in main model_args, data_args, training_args = parser.parse_args_into_dataclasses() File "/opt/conda/lib/python3.8/site-packages/transformers/hf_argparser.py", line 180, in parse_args_into_dataclasses obj = dtype(inputs) File "", line 60, in init File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 479, in post_init__ if is_torch_available() and self.device.type != "cuda" and self.fp16: File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1346, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 601, in device return self._setup_devices File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1336, in get__ cached = self.fget(obj) File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1346, in wrapper return func(*args, kwargs) File "/opt/conda/lib/python3.8/site-packages/transformers/training_args.py", line 563, in _setup_devices deepspeed.init_distributed() File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/distributed.py", line 41, in init_distributed torch.distributed.init_process_group(backend=dist_backend) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 480, in init_process_group barrier() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2190, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, unhandled system error, NCCL version 2.8.3 ncclSystemError: System call (socket, malloc, munmap, etc) failed.

[2021-02-18T11:48:40.444748] Finished context manager injector with Exception.

Thankyou so much in advance for the help.

awan-10 commented 3 years ago

I see. Are you using DeepSpeed Curated Environment? or your custom built environment? The error message you have is coming from NCCL 2.8.3, which is not what we have in the curated environment.

Can you please run ds_report on your machine and share the output of it? I want to see what version of PyTorch, NCCL, and DeepSpeed is being used here.

cc @jeffra

manchandasahil commented 3 years ago

Hi, Thankyou for replying. I have copied the ds_report from my docker image:

Please ignore CUDA errors i ran it on my local system and it does not have gpus.

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] /bin/bash: line 0: type: llvm-config: not found /bin/bash: line 0: type: llvm-config-9: not found [WARNING] sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist! [WARNING] sparse_attn requires CUDA version 10.1+, does not currently support >=11 or <10.1 sparse_attn ............ [NO] ....... [NO] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:100.) return torch._C._cuda_getDeviceCount() > 0 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' DeepSpeed general environment info: torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch'] torch version .................... 1.8.0a0+1606899 torch cuda version ............... 11.1 nvcc version ..................... 11.1 deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.3.10, unknown, unknown deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1

manchandasahil commented 3 years ago

Actually, Do you guys have any latest image with apex, openmpi and everything that needs to run on azure, or any base image that we can take and install on top of : is the docker repo : deepspeed/deepspeed correct ?

microsoft / DeepSpeed