Closed thundergolfer closed 3 months ago
The reproduction program is almost identical to the one in https://github.com/google/gvisor/issues/9827, which is why I revisited that issue's test.
This seems to be running fine for me on an A100-40GB machine in GCE on driver version 535.104.05
:
(base) ayushranjan_google_com@a100:~/issue10413$ docker run --runtime=runsc --shm-size=128g --gpus=all --rm issue10413:latest
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='15:24:33') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|██████████| 169001437/169001437 [00:18<00:00, 9193099.59it/s]
Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 156MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M Trainable params
0 Non-trainable params
23.7 M Total params
94.852 Total estimated model params size (MB)
Epoch 0: 100%|██████████| 391/391 [01:08<00:00, 5.68it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 391/391 [01:09<00:00, 5.62it/s, v_num=0]
-------------------------------------------------------------------------------
repro.py 63 <module>
print(f"Training duration (seconds): {time.time() - start:2.f}")
ValueError:
Format specifier missing precision
(base) ayushranjan_google_com@a100:~/issue10413$ nvidia-smi
Thu May 9 15:27:46 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB Off | 00000000:00:04.0 Off | 0 |
| N/A 35C P0 49W / 400W | 4MiB / 40960MiB | 27% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Please note:
--shm-size=128g
as per https://github.com/google/gvisor/issues/9827#issuecomment-1877649009.nvproxy: unknown
lines.So maybe you are using a different driver version? Or maybe something to do with the Oracle Cloud environment?
--shm-size
is also set very large. On Oracle workers it's around 1657GB.We have Driver Version: 535.129.03 CUDA Version: 12.2
. Sorry should have included that in the issue originally!
On H100 worker:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:04:00.0 Off | 0 |
| N/A 36C P0 113W / 700W | 72459MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:05:00.0 Off | 0 |
| N/A 34C P0 117W / 700W | 72507MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:0A:00.0 Off | 0 |
| N/A 35C P0 114W / 700W | 72507MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:0B:00.0 Off | 0 |
| N/A 33C P0 111W / 700W | 72587MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:84:00.0 Off | 0 |
| N/A 60C P0 578W / 700W | 71533MiB / 81559MiB | 95% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:85:00.0 Off | 0 |
| N/A 34C P0 112W / 700W | 841MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:8A:00.0 Off | 0 |
| N/A 34C P0 114W / 700W | 16463MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:8B:00.0 Off | 0 |
| N/A 34C P0 111W / 700W | 2405MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 759790 C /opt/conda/bin/python3.10 72446MiB |
We use the same driver version across all GPU workers.
Updated driver version and still can not repro the failure on my GCE VM:
(base) ayushranjan_google_com@a100:~/issue10413$ docker run --runtime=runsc --shm-size=128g --gpus=all --rm issue10413:latest
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='16:01:41') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|██████████| 169001437/169001437 [00:18<00:00, 9140159.09it/s]
Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:01<00:00, 74.1MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M Trainable params
0 Non-trainable params
23.7 M Total params
94.852 Total estimated model params size (MB)
Epoch 0: 100%|██████████| 391/391 [01:08<00:00, 5.68it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 391/391 [01:09<00:00, 5.62it/s, v_num=0]
Training duration (seconds): 72.35
Surprisingly, this workload gets stuck without gVisor. I will add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD
to nvproxy though, hopefully it resolves whatever failure you are seeing.
Surprisingly, this workload gets stuck without gVisor.
Interesting. This may be the same problem as in https://github.com/google/gvisor/issues/9827 where the test got stuck on runc
.
The program doesn't get stuck on runc
in Modal. It completes in around 60s. A 72.35 second completion for gVisor lines up with that.
I will add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy though, hopefully it resolves whatever failure you are seeing.
🙏
@thundergolfer Let me know if https://github.com/google/gvisor/commit/e9b3218681cdfac0989e95b27642e4aec67d0ea6 fixes the issue. If so, please close this.
Are you still hitting this issue?
No we're not, happy to have it closed 👍
Description
Doing multi-GPU training on A100s and seeing that on gVisor it gets stuck. Tried the below program on the following GPUs within Modal:
Both the H100 and A100 run into these unknown control commands:
Which is
NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD
-> https://github.com/NVIDIA/open-gpu-kernel-modules/blob/083cd9cf17ab95cd6f9fb50a5349c21eaa2f7d4b/src/common/sdk/nvidia/inc/ctrl/ctrl0000/ctrl0000unix.h#L146-L147Steps to reproduce
runsc version
docker version (if using docker)
uname
No response
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)