pytorch / torchx

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
https://pytorch.org/torchx
Other
304 stars 97 forks source link

local_docker does not add utility nvidia libraries to containers #906

Closed clumsy closed 3 weeks ago

clumsy commented 3 weeks ago

🐛 Bug

nvidia Docker images require adding libraries like libnvidia-ml that are part of utility capability.

TorchX currently only adds compute here.

There are two solutions I verified for this issue, not sure which one is better:

NOTE: nvidia-container-runtime has been superseded by nvidia-container-toolkit.

Module (check all that applies):

To Reproduce

Steps to reproduce the behavior:

1. 1. 1.

This results in a crash with:

train/0 [0]:[rank: 0] Global seed set to 10
train/0 [0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
train/0 [0]:----------------------------------------------------------------------------------------------------
train/0 [0]:distributed_backend=nccl
train/0 [0]:All distributed processes registered. Starting with 1 processes
train/0 [0]:----------------------------------------------------------------------------------------------------
train/0 [0]:
train/0 [0]:Error executing job with overrides: ['++trainer.max_steps=10']
train/0 [0]:Traceback (most recent call last):
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
train/0 [0]:    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
train/0 [0]:    return function(*args, **kwargs)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
train/0 [0]:    self._run(model, ckpt_path=ckpt_path)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
train/0 [0]:    self.__setup_profiler()
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1069, in __setup_profiler
train/0 [0]:    self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1192, in log_dir
train/0 [0]:    dirpath = self.strategy.broadcast(dirpath)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 292, in broadcast
train/0 [0]:    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
train/0 [0]:    return func(*args, **kwargs)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2597, in broadcast_object_list
train/0 [0]:    broadcast(object_sizes_tensor, src=src, group=group)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
train/0 [0]:    return func(*args, **kwargs)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1900, in broadcast
train/0 [0]:    work = default_pg.broadcast([tensor], opts)
train/0 [0]:torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
train/0 [0]:ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
train/0 [0]:Last error:
train/0 [0]:Failed to open libnvidia-ml.so.1

Expected behavior

libnvidia-ml and other libraries should be added to container.

Environment

PyTorch version: 2.2.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Amazon Linux 2 (x86_64)
GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
Clang version: Could not collect
CMake version: version 3.26.4
Libc version: glibc-2.26

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.10.201-191.748.amzn2.x86_64-x86_64-with-glibc2.26
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 535.104.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:            1
CPU MHz:             2699.588
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4600.02
Hypervisor vendor:   Xen
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
...

Additional context

clumsy commented 3 weeks ago

Please advise @kiukchung, @d4l3k

clumsy commented 3 weeks ago

Looks like this issue can be closed after the fix was merged, @andywag. I still wonder if we can remove device_request completely from local_docker to let it default to compute,utility.