Open manchandasahil opened 3 years ago
@manchandasahil, have you seen this tutorial: https://www.deepspeed.ai/tutorials/azure/
Hi @manchandasahil, if Azure ML works for you, we have a short video about DeepSpeed on AML here: https://www.youtube.com/watch?v=yBVXR8G8Bg8&feature=youtu.be
We are also part of:
@awan-10 Thankyou so much for your tutorial i followed it and was able to run deepspeed on a single node but as i increase the number of nodes i have the following:
[2021-02-18T11:48:40.368929] Writing error with error_code UserError to hosttool error file located at /mnt/batch/tasks/workitems/1e9edaa3-985f-4e6b-895d-025a0c3283eb/job-1/bert-pretraining-ds-_ccbd9647-0f77-4d0b-8e47-6db683e1d7f3/wd/runTaskLetTask_error.json
Starting the daemon thread to refresh tokens in background for process with pid = 1733
Traceback (most recent call last):
File "run_mlm.py", line 661, in
[2021-02-18T11:48:40.444748] Finished context manager injector with Exception.
Thankyou so much in advance for the help.
I see. Are you using DeepSpeed Curated Environment? or your custom built environment? The error message you have is coming from NCCL 2.8.3, which is not what we have in the curated environment.
Can you please run ds_report on your machine and share the output of it? I want to see what version of PyTorch, NCCL, and DeepSpeed is being used here.
cc @jeffra
Hi, Thankyou for replying. I have copied the ds_report from my docker image:
Please ignore CUDA errors i ran it on my local system and it does not have gpus.
/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:100.) return torch._C._cuda_getDeviceCount() > 0 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' DeepSpeed general environment info: torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch'] torch version .................... 1.8.0a0+1606899 torch cuda version ............... 11.1 nvcc version ..................... 11.1 deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.3.10, unknown, unknown deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1
Actually, Do you guys have any latest image with apex, openmpi and everything that needs to run on azure, or any base image that we can take and install on top of : is the docker repo : deepspeed/deepspeed correct ?
I am trying to schedule a bert-large mlm training on azure compute cluster using Deepspeed. Seems like deepspeed needs ssh capabilities with the VMs but our data is Huge. hence, we need some idea of how to launch deepspeed on a Azure compute cluster.