RuntimeError: CUDA error: invalid device ordinal and setting up NCCL + requesting subprocess model update for python 3.6+

AymenTlili131 commented 1 year ago

HI there maintainers, first off I'm thankful to the devs and engineering that went behind setting up this framework .I tried picking it up and as a to simulating GPU parallel computing with NCCL I ran into some issues . here's the error i'm currently trying to fix .

error [1]

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

My system is ZorinOS 16 which is based on ubuntu20.04 , I'm trying to use an Nvidia RTX 3060 GPU

nvidia-smi returns the following

| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 31%   27C    P8    14W / 170W |   1426MiB / 12045MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1484      G   /usr/lib/xorg/Xorg                128MiB |
|    0   N/A  N/A      1633      G   /usr/bin/gnome-shell               89MiB |
|    0   N/A  N/A      7155      G   ...548701901119532058,131072       28MiB |
|    0   N/A  N/A     12476      C   ...da3/envs/FLUTE/bin/python     1175MiB |

and nvcc --version returns the following

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Oct_11_21:27:02_PDT_2021
Cuda compilation tools, release 11.4, V11.4.152
Build cuda_11.4.r11.4/compiler.30521435_0

this screenshot displays that I have pytorch environment almost ready to go .

Screenshot from 2023-02-13 13-29-41

now when trying to install nccl , I can't find a way to confirm if the installation is succesful , or where the nccl home is .

using the command (python -m torch.distributed.run --nproc_per_node=3 e2e_trainer.py -dataPath ./testing -outputPath scratch -config testing/hello_world_nlg_gru.yaml -task nlg_gru -backend nccl) in readme yields the following and no models being stored in the scratch folders error [1]'s original stack

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (packaging 22.0 (/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages), Requirement.parse('packaging<22.0,>=20.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (packaging 22.0 (/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages), Requirement.parse('packaging<22.0,>=20.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (packaging 22.0 (/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages), Requirement.parse('packaging<22.0,>=20.0')).
The data can be found here: The data can be found here: The data can be found here:   ./testing ./testing./testing

Mon Feb 13 12:39:20 2023 : Assigning default values for: {'batch_size', 'max_grad_norm'} in [server_config][val][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames', 'max_grad_norm'} in [server_config][test][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames'} in [client_config][train][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'max_grad_norm', 'batch_size'} in [server_config][val][data_config]Mon Feb 13 12:39:20 2023 : Assigning default values for: {'batch_size', 'max_grad_norm'} in [server_config][val][data_config]

Mon Feb 13 12:39:20 2023 : Assigning default values for: {'max_grad_norm', 'num_frames'} in [server_config][test][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames', 'max_grad_norm'} in [server_config][test][data_config]
Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames'} in [client_config][train][data_config]Mon Feb 13 12:39:20 2023 : Assigning default values for: {'num_frames'} in [client_config][train][data_config]

Mon Feb 13 12:39:20 2023 : Backend: nccl
Mon Feb 13 12:39:20 2023 : Backend: nccl
Mon Feb 13 12:39:20 2023 : Backend: nccl
Added key: store_based_barrier_key:1 to store for rank: 0Added key: store_based_barrier_key:1 to store for rank: 2

Added key: store_based_barrier_key:1 to store for rank: 1
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
Mon Feb 13 12:39:20 2023 : Assigning worker to GPU 1
Traceback (most recent call last):
  File "e2e_trainer.py", line 238, in <module>
    run_worker(model_path, config, task, data_path, local_rank, backend)
  File "e2e_trainer.py", line 100, in run_worker
    torch.cuda.set_device(device)
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/cuda/__init__.py", line 326, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
Mon Feb 13 12:39:20 2023 : Assigning worker to GPU 0Mon Feb 13 12:39:20 2023 : Assigning worker to GPU 2

Preparing model .. Initializing
Traceback (most recent call last):
  File "e2e_trainer.py", line 238, in <module>
    run_worker(model_path, config, task, data_path, local_rank, backend)
  File "e2e_trainer.py", line 100, in run_worker
    torch.cuda.set_device(device)
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/cuda/__init__.py", line 326, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
GRU(
  (embedding): Embedding()
  (rnn): GRU2(
    (w_ih): Linear(in_features=160, out_features=1536, bias=True)
    (w_hh): Linear(in_features=512, out_features=1536, bias=True)
  )
  (squeeze): Linear(in_features=512, out_features=160, bias=False)
)
Mon Feb 13 12:39:20 2023 : initialize model with default settings
Mon Feb 13 12:39:20 2023 : trying to move the model to GPU
Mon Feb 13 12:39:21 2023 : model: GRU(
  (embedding): Embedding()
  (rnn): GRU2(
    (w_ih): Linear(in_features=160, out_features=1536, bias=True)
    (w_hh): Linear(in_features=512, out_features=1536, bias=True)
  )
  (squeeze): Linear(in_features=512, out_features=160, bias=False)
)
Mon Feb 13 12:39:21 2023 : torch.cuda.memory_allocated(): 10909184
/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/cuda/memory.py:397: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
  FutureWarning)
Mon Feb 13 12:39:21 2023 : torch.cuda.memory_cached(): 23068672
Mon Feb 13 12:39:21 2023 : torch.cuda.synchronize(): None
Loading json-file:  ./testing/data/nlg_gru/val_data.json
Loading json-file:  ./testing/data/nlg_gru/test_data.json
Loading json-file:  ./testing/data/nlg_gru/train_data.json
Mon Feb 13 12:39:21 2023 : Server data preparation
Mon Feb 13 12:39:21 2023 : No server training set is defined
Mon Feb 13 12:39:21 2023 : Prepared the dataloaders
Mon Feb 13 12:39:21 2023 : Loading Model from: None
Could not load the run context. Logging offline
Attempted to log scalar metric System memory (GB):
15.414344787597656
Attempted to log scalar metric server_config.num_clients_per_iteration:
10
Attempted to log scalar metric server_config.max_iteration:
3
Attempted to log scalar metric dp_config.eps:
0
Attempted to log scalar metric dp_config.max_weight:
0
Attempted to log scalar metric dp_config.min_weight:
0
Attempted to log scalar metric server_config.optimizer_config.type:
adam
Attempted to log scalar metric server_config.optimizer_config.lr:
0.003
Attempted to log scalar metric server_config.optimizer_config.amsgrad:
True
Attempted to log scalar metric server_config.annealing_config.type:
step_lr
Attempted to log scalar metric server_config.annealing_config.step_interval:
epoch
Attempted to log scalar metric server_config.annealing_config.gamma:
1.0
Attempted to log scalar metric server_config.annealing_config.step_size:
100
Mon Feb 13 12:39:21 2023 : Launching server
Mon Feb 13 12:39:21 2023 : server started
Attempted to log scalar metric Max iterations:
3
Attempted to log scalar metric LR for agg. opt.:
0.003
Mon Feb 13 12:39:21 2023 : Running ['val'] at itr=0
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12703 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 12704) of binary: /home/crns/anaconda3/envs/FLUTE/bin/python
Traceback (most recent call last):
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
    )(*cmd_args)
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/crns/anaconda3/envs/FLUTE/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
e2e_trainer.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-02-13_12:39:24
  host      : crns-IdeaCentre-Gaming5-14IOB6
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 12705)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-13_12:39:24
  host      : crns-IdeaCentre-Gaming5-14IOB6
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 12704)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

and before that I tried running pytest -v -s in ./testing Screenshot from 2023-02-13 13-36-31

so my guess was that I haven't setup NCCL properly , I tried to find the legacy build compatible with mine from https://developer.nvidia.com/nccl/nccl-legacy-downloads and got NCCL 2.11.4, for CUDA 11.4, September 7, 2021

and as instructed used " sudo apt install libnccl2=2.11.4-1+cuda11.4 libnccl-dev=2.11.4-1+cuda11.4 " as instructed which went smoothly but I still encountered the older stack trace .

going to nvidia's NCCL test repo , I skip the installation steps because I have an official release then try to do "make" then "./build/all_reduce_perf -b 8 -e 256M -f 2 -g 1" (I tried changed the -g argument to 4 or keeping ngpus) and got the same error either way ./build/all_reduce_perf: symbol lookup error: ./build/all_reduce_perf: undefined symbol: ncclRedOpCreatePreMulSum

now that's where I stopped with those 2 issues where I feel solving one would help the other .

but before I got this far I had to reformat the workstation acouple times seeing Nvidia fails to keep all the necessary compatibility information in one place but this post saved me in my previous environments , I managed to get FLute running on gloo but I still had a similer warning stack trace but models could be saved .

in this fresh environment I also had trouble importing and using the python built-in subprocess module specifically because the "run" method generated errors that I worked around around with this https://stackoverflow.com/questions/40590192/getting-an-error-attributeerror-module-object-has-no-attribute-run-while but even then I was still receiving an error with that solution because "text" had a TypeError and couldn't be passed to Popen class constructor Failed: TypeError: __init__() got an unexpected keyword argument 'text'

so my investigation led to the fact that the text argument was added after python 3.7 and when your readme.md suggests 3.8 thus the problem I can understand if you have been working on this project for a long time but this could have been a seperate issue because it causes the tests in pytest -v -s to fail. that you can label as an enhancement but I felt it could be related to why the processes aren't being assigned to the virtual gpus properly.

other honorable mentions include using : sickit-learn instead of deprecated sklearn in requirements.txt and that using newest version of pytorch 1.13 compatible with cuda 11.7 leaves the speech recognition task with deprecated torchaudio

Apologies if I mentioned several irrelevant steps or issues but I hope that I can get an exact answer to error[1]'s stack trace and quickly get back to focusing on the experimentation side research . thanks to the msrflute team and hope to hear from u soon

Mirian-Hipolito commented 1 year ago

Hello @AymenTlili131,

We really appreciate your feedback. I was able to reproduce the same error on my end and it seems to me that this is not a matter of NCCL setup but the numbers of GPUs you're trying to assign. When running on NCCL, torch distributed receives the argument --nproc_per_node as the number of GPUs you have available in your sytem to run the simulation, however FLUTE requires at least 2 in order to launch: 1 Server and 1 Worker that can execute many clients, but I can see you only have 1 available (GPU 0).

This is the stacktrace .. as you can see the problem occurs at the assignation time.

I took a look at NCCL test repo and noticed that the -g argument correspond to the number of available GPUs, this is the reason of the fail, given that you only have 1 available it's not able to run with a higher number.

You can find more information about FLUTE architecture here. There is one issue already open for this situation here: https://github.com/microsoft/msrflute/issues/15 , we apologize for the inconvenience at this moment.

Regarding the comments about the requirements/ python versions, we will make sure to update them during the next commit.

Let me know if this information is useful or if we can provide more support on this. 🙂

Thanks, Mirian

AymenTlili131 commented 1 year ago

Thanks for writing back so soon , I'll request access to a workstation with 2 or more GPUs and test it for myself but this is a solid and good explination to why the error was raised , thanks ! I'd still like to keep the issue open until I confirm that it indeed works (not more than a week ). Reading the linked FLUTE architecture it should and will work but hopefully i won't take long with the environment setup and testing before I get back to you .

Mirian-Hipolito commented 1 year ago

Thanks @AymenTlili131! Let us know if this issue persists.

Regards, Mirian.

AymenTlili131 commented 1 year ago

Hey @Mirian-Hipolito Things are up and running on my end . I'm grateful for your explanation and support and hope you and the maintainers have a wonderful rest of week . I'll make sure to cite the FLUTE team if I find anything useful ! thanks again Kind regards

Mirian-Hipolito commented 1 year ago

Hello @AymenTlili131, we are happy to share that we have removed the restriction of minimum number of GPUs to run FLUTE in our latest release. For more documentation about how to run an experiments using a single GPU, please refer to the README.

AymenTlili131 commented 1 year ago

Hey Mirian , This is great news .I gained access to other GPUs meanwhile and did experiment with working on them remotely but thanks to your efforts and your colleagues' I can experiment with tweaks and proofs of ideas at a much smaller scale . Greatly appreciate and thanks to the entire Microsoft family

microsoft / msrflute

RuntimeError: CUDA error: invalid device ordinal and setting up NCCL + requesting subprocess model update for python 3.6+ #20