pytorch / torchx

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
https://pytorch.org/torchx
Other
333 stars 110 forks source link

Use GPU with `local_docker` #648

Closed vwxyzjn closed 2 years ago

vwxyzjn commented 2 years ago

🐛 Bug

Can't use GPU with the local_docker scheduler.

Module (check all that applies):

To Reproduce

Steps to reproduce the behavior:

  1. create a test.py with
import torch
print("torch.cuda.is_available():", torch.cuda.is_available())
  1. create a Dockerfile
FROM ghcr.io/pytorch/torchx:0.3.0
COPY test.py test.py
  1. run the following commands
docker build -t test:latest .
docker run --gpus all test:latest python test.py
torchx run --scheduler local_cwd utils.python --script test.py
torchx run --scheduler local_docker utils.python --script test.py
Sending build context to Docker daemon  6.144kB
Step 1/2 : FROM ghcr.io/pytorch/torchx:0.3.0
 ---> 343f0f3b1a07
Step 2/2 : COPY test.py test.py
 ---> Using cache
 ---> fa75170948b2
Successfully built fa75170948b2
Successfully tagged test:latest
torch.cuda.is_available(): True
torchx 2022-11-05 13:29:02 INFO     loaded configs from /home/costa/Documents/go/src/github.com/vwxyzjn/test/y/torchx_test/.torchxconfig
torchx 2022-11-05 13:29:02 INFO     Log directory not set in scheduler cfg. Creating a temporary log dir that will be deleted on exit. To preserve log directory set the `log_dir` cfg option
torchx 2022-11-05 13:29:02 INFO     Log directory is: /tmp/torchx_6_h698gw
local_cwd://torchx/torchx_utils_python-mfc1scwb7dncd
torchx 2022-11-05 13:29:02 INFO     Waiting for the app to finish...
python/0 torch.cuda.is_available(): True
torchx 2022-11-05 13:29:04 INFO     Job finished: SUCCEEDED
torchx 2022-11-05 13:29:05 WARNING  `gpus = all` was declared in the [local_docker] section  of the config file but is not a runopt of `local_docker` scheduler. Remove the entry from the config file to no longer see this warning
torchx 2022-11-05 13:29:05 INFO     loaded configs from /home/costa/Documents/go/src/github.com/vwxyzjn/test/y/torchx_test/.torchxconfig
torchx 2022-11-05 13:29:05 INFO     Checking for changes in workspace `file:///home/costa/Documents/go/src/github.com/vwxyzjn/test/y/torchx_test`...
torchx 2022-11-05 13:29:05 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2022-11-05 13:29:06 INFO     Built new image `sha256:32cf796cecfd488d7e0e5ba5069e9218098bed75597b3b402b9c557a796e5f4a` based on original image `ghcr.io/pytorch/torchx:0.3.0` and changes in workspace `file:///home/costa/Documents/go/src/github.com/vwxyzjn/test/y/torchx_test` for role[0]=python.
local_docker://torchx/torchx_utils_python-bq7cx57f1c6wr
torchx 2022-11-05 13:29:06 INFO     Waiting for the app to finish...
python/0 torch.cuda.is_available(): False
torchx 2022-11-05 13:29:07 INFO     Job finished: SUCCEEDED

Expected behavior

Notice that torch identifies the GPU device when running with poetry run torchx run --scheduler local_cwd utils.python --script test.py, but it fails to do so when running with poetry run torchx run --scheduler local_docker utils.python --script test.py. Also, when running docker run --gpus all test:latest python test.py, GPU is also recognized.

Environment

PyTorch version: 1.13.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Pop!_OS 21.10 (x86_64)
GCC version: (Ubuntu 11.2.0-7ubuntu2) 11.2.0
Clang version: Could not collect
CMake version: version 3.18.4
Libc version: glibc-2.34

Python version: 3.9.5 (default, Jul 19 2021, 13:27:26)  [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-5.17.5-76051705-generic-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: 11.3.109
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 3060 Ti
GPU 1: NVIDIA GeForce RTX 3060 Ti

Nvidia driver version: 470.103.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] botorch==0.6.0
[pip3] gpytorch==1.9.0
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.4
[pip3] pytorch-lightning==1.5.10
[pip3] torch==1.13.0
[pip3] torch-model-archiver==0.6.0
[pip3] torchmetrics==0.10.2
[pip3] torchserve==0.6.0
[pip3] torchtext==0.14.0
[pip3] torchvision==0.14.0
[pip3] torchx==0.3.0
[conda] Could not collect

Versions of CLIs:
AWS CLI: N/A
gCloud CLI: None
AZ CLI: None
Slurm: N/A
Docker: 20.10.12, build e91ed57
kubectl: None

torchx dev package versions:
aiobotocore:2.1.0
black:22.3.0
boto3:1.20.24
botorch:0.6.0
captum:0.5.0
flake8:3.9.0
gpytorch:1.9.0
hydra-core:1.2.0
ipython:8.6.0
kfp:1.8.9
kfp-pipeline-spec:0.1.16
kfp-server-api:1.8.5
moto:3.0.2
Pygments:2.13.0
pyre-extensions:0.0.21
pytest:7.2.0
pytorch-lightning:1.5.10
requests:2.28.1
requests-oauthlib:1.3.1
requests-toolbelt:0.10.1
strip-hints:0.1.10
torch:1.13.0
torch-model-archiver:0.6.0
torchmetrics:0.10.2
torchserve:0.6.0
torchtext:0.14.0
torchvision:0.14.0
torchx:0.3.0
traitlets:5.5.0
ts:0.5.1
usort:1.0.2

torchx config:
[local_cwd]

[local_docker]
gpus = all

Additional context

d4l3k commented 2 years ago

Hi, we don't add GPUs to the docker container unless the job requests them and utils.python defaults to 0 gpus. Can you try passing --gpu 2 to the utils.python component?

https://github.com/pytorch/torchx/blob/main/torchx/components/utils.py#L134

https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L292-L302

vwxyzjn commented 2 years ago

It works! Thank you!