Open thundergolfer opened 10 months ago
Let me know if https://github.com/google/gvisor/pull/9828 solves this issue. Do you know if this will repro on T4 as well? Hard to get a hold of A100. But I will try again tomorrow.
I tried running the above mentioned Docker image on an A100 with runsc, it segfaults and crashes with a different error:
Traceback (most recent call last):
File "repro.py", line 65, in <module>
trainer.fit(model, train_dl, val_dl)
File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/strategies/launchers/multiprocessing.py", line 141, in launch
while not process_context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 145, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV
The boot logs show that a different set of ioctls are not implemented:
$ cat /tmp/logs/runsc.log.20231220-181102.011243.boot.txt | grep "nvproxy: unknown"
W1220 18:11:35.603862 18656 frontend.go:683] [ 15: 15] nvproxy: unknown allocation class 0x0000cb33
W1220 18:11:35.616556 18656 frontend.go:521] [ 15: 15] nvproxy: unknown control command 0x2080182b (paramsSize=20)
What Nvidia driver version is being used at Modal? I was testing on 525.105.17.
0x0000cb33
allocation class is NV_CONFIDENTIAL_COMPUTE
, which was only added in 535.43.02.
So I tested again with 535.54.03 driver, and the above mentioned unknown ioctls dissapeared. However, the segfault persists. This does not happen with runc. I can reproduce this segfault on a T4 GPU too.
Here is the relevant logs from the segfault (looks like a null pointer dereference):
RIP = 0x7eb1f7de9ddb
VMAs:
...
7eb1f7d68000-7eb1f7d8a000 r--p 00000000 00:19 34 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7eb1f7d8a000-7eb1f7f02000 r-xp 00022000 00:19 34 /usr/lib/x86_64-linux-gnu/libc-2.31.so
So we need to look at offset 0x00022000 + (0x7eb1f7de9ddb - 0x7eb1f7d8a000) = 0x81ddb
in /usr/lib/x86_64-linux-gnu/libc-2.31.so
.
Using objdump -d
:
81dbd: 48 89 44 24 28 mov %rax,0x28(%rsp)
81dc2: 8b 44 24 24 mov 0x24(%rsp),%eax
81dc6: 48 8d 35 1d 42 11 00 lea 0x11421d(%rip),%rsi # 195fea <_libc_intl_domainname@@GLIBC_2.2.5+0x1012>
81dcd: 4c 89 ef mov %r13,%rdi
81dd0: 89 c2 mov %eax,%edx
81dd2: 83 c0 01 add $0x1,%eax
81dd5: 89 44 24 24 mov %eax,0x24(%rsp)
81dd9: 31 c0 xor %eax,%eax
81ddb: e8 50 1e fd ff callq 53c30 <fprintf@@GLIBC_2.2.5> <----- HERE
81de0: 64 8b 04 25 18 00 00 mov %fs:0x18,%eax
81de7: 00
@nixprime pointed out that I was looking at the objdump of the wrong file. He figured out the actual fault instruction:
# objdump -d /usr/lib/x86_64-linux-gnu/libc-2.31.so | less
...
0000000000081dd0 <_IO_fclose@@GLIBC_2.2.5>:
81dd0: f3 0f 1e fa endbr64
81dd4: 41 54 push %r12
81dd6: 55 push %rbp
81dd7: 48 89 fd mov %rdi,%rbp
81dda: 53 push %rbx
81ddb: 8b 07 mov (%rdi),%eax <--- here
Per logs %rdi = 0
, so the segfault lines up. He found out that if libnccl.so.2:getHostHash()
fails to open /proc/sys/kernel/random/boot_id
, then it calls fclose(NULL) and takes SIGSEGV.
https://github.com/google/gvisor/pull/9833 fixes this issue, but now we get a different exception:
Epoch 0: 0%| | 0/391 [00:00<?, ?it/s] ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "repro.py", line 65, in <module>
trainer.fit(model, train_dl, val_dl)
File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/strategies/launchers/multiprocessing.py", line 141, in launch
while not process_context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1132, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
if not self._poll(timeout):
File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/usr/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 36) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
The logs also show an unimplemented control command (NV2080_CTRL_CMD_NVLINK_GET_NVLINK_CAPS
). I added support for it in https://github.com/google/gvisor/pull/9835. But it did not fix the issue.
The logs also shows 4 user faults happening at the same instruction:
D1221 15:38:39.022730 228800 task_run.go:312] [ 60: 60] Unhandled user fault: addr=7ea232582000 ip=7ea3c2f95963 access=r-- sig=11 err=BusError: no space left on device
...
VMAs:
...
7ea3c2e0a000-7ea3c2e2c000 r--p 00000000 00:19 34 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ea3c2e2c000-7ea3c2fa4000 r-xp 00022000 00:19 34 /usr/lib/x86_64-linux-gnu/libc-2.31.so
So need to look at (0x7ea3c2f95963 - 0x7ea3c2e2c000) + 0x22000 = 0x18b963
offset in /usr/lib/x86_64-linux-gnu/libc-2.31.so
.
root@2b2350fada82:/# objdump -d /usr/lib/x86_64-linux-gnu/libc-2.31.so | less
...
18b927: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
18b92b: c5 fe 6f 4c 16 e0 vmovdqu -0x20(%rsi,%rdx,1),%ymm1
18b931: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
18b935: c5 fe 7f 4c 17 e0 vmovdqu %ymm1,-0x20(%rdi,%rdx,1)
18b93b: c5 f8 77 vzeroupper
18b93e: c3 retq
18b93f: 48 3b 15 72 5b 06 00 cmp 0x65b72(%rip),%rdx # 1f14b8 <mallwatch@@GLIBC_2.2.5+0x8>
18b946: 0f 83 25 01 00 00 jae 18ba71 <__nss_database_lookup@GLIBC_2.2.5+0x283f1>
18b94c: 48 39 f7 cmp %rsi,%rdi
18b94f: 72 0f jb 18b960 <__nss_database_lookup@GLIBC_2.2.5+0x282e0>
18b951: 74 12 je 18b965 <__nss_database_lookup@GLIBC_2.2.5+0x282e5>
18b953: 4c 8d 0c 16 lea (%rsi,%rdx,1),%r9
18b957: 4c 39 cf cmp %r9,%rdi
18b95a: 0f 82 c5 01 00 00 jb 18bb25 <__nss_database_lookup@GLIBC_2.2.5+0x284a5>
18b960: 48 89 d1 mov %rdx,%rcx
18b963: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi) <--- here
Our driver version is still 525.60.13
.
Sorry I couldn't get a multi-GPU A100 VM to test the repro myself. Looks like it's thrown up a lot of things! Internally we populate boot_id
so makes sense that I didn't see that running the Modal script.
Do you know if the A100 GPU has 40GB memory or 80GB?
So the BusError: no space left on device
error on page fault in gVisor happens because we are hitting tmpfs size limit for /dev/shm
. The OCI spec shows that /dev/shm
has a limit of 67108864 bytes:
...
{
"destination": "/dev/shm",
"type": "tmpfs",
"source": "/run/containerd/io.containerd.runtime.v2.task/moby/bfb43b3231cbb1f7e60e78f81eab2f25d80f48e802cb5d29185979785746b814/shm",
"options": [
"nosuid",
"noexec",
"nodev",
"mode=1777",
"size=67108864"
]
},
...
Some investigation showed that we were over allocating in tmpfs on page faults. See: https://github.com/google/gvisor/blob/149350e5c428ae9b4ca8f2e2b7960071708113dd/pkg/sentry/fsimpl/tmpfs/regular_file.go#L300-L310
We were trying to allocate optional
range if possible. Many times, optional
was much bigger than required
range. MemoryManager.getPMAsInternalLocked()
passes a larger optional
range if the PMA gap is larger than the requested range length. Allocating the entire PMA gap can prevent future page faults. But in this case, it is causing us to hit tmpfs size limits really quickly.
So I updated tmpfs to only allocate required
range on page faults in #9839. Maybe it helped, but we are still hitting the same error. Need to investigate if tmpfs is overallocating in some other place, or not releasing pages, or there is some page accounting bug.
Strace logs show an interesting pattern:
I1223 19:14:31.881425 937434 strace.go:564] [ 15: 15] python3 E lstat(0x7ef4b30c5190 /dev/shm/ZVh4Mj, 0x7ef4b30c50a0)
I1223 19:14:31.881442 937434 strace.go:602] [ 15: 15] python3 X lstat(0x7ef4b30c5190 /dev/shm/ZVh4Mj, 0x7ef4b30c50a0) = 0 (0x0) errno=2 (no such file or directory) (4.447ยตs)
I1223 19:14:31.881462 937434 strace.go:570] [ 15: 15] python3 E openat(AT_FDCWD /, 0x7ef4b30c5190 /dev/shm/ZVh4Mj, O_RDWR|O_CREAT|O_EXCL, 0o600)
I1223 19:14:31.881486 937434 strace.go:608] [ 15: 15] python3 X openat(AT_FDCWD /, 0x7ef4b30c5190 /dev/shm/ZVh4Mj, O_RDWR|O_CREAT|O_EXCL, 0o600) = 73 (0x49) (11.537ยตs)
I1223 19:14:31.881529 937434 strace.go:567] [ 15: 15] python3 E write(0x49 /dev/shm/ZVh4Mj, 0x7ef4b30c5240 "\xff\xff\xff\x7f\x00\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", 0x20)
I1223 19:14:31.881556 937434 strace.go:605] [ 15: 15] python3 X write(0x49 /dev/shm/ZVh4Mj, ..., 0x20) = 32 (0x20) (14.519ยตs)
I1223 19:14:31.881587 937434 strace.go:576] [ 15: 15] python3 E mmap(0x0, 0x20, PROT_READ|PROT_WRITE, MAP_SHARED, 0x49 /dev/shm/ZVh4Mj, 0x0)
I1223 19:14:31.881609 937434 strace.go:614] [ 15: 15] python3 X mmap(0x0, 0x20, PROT_READ|PROT_WRITE, MAP_SHARED, 0x49 /dev/shm/ZVh4Mj, 0x0) = 139497752580096 (0x7edf59fd5000) (9.119ยตs)
I1223 19:14:31.881649 937434 strace.go:564] [ 15: 15] python3 E link(0x7ef4b30c5190 /dev/shm/ZVh4Mj, 0x7ef4b30c51b0 /dev/shm/sem.mp-uq5k_1mc)
I1223 19:14:31.881670 937434 strace.go:602] [ 15: 15] python3 X link(0x7ef4b30c5190 /dev/shm/ZVh4Mj, 0x7ef4b30c51b0 /dev/shm/sem.mp-uq5k_1mc) = 0 (0x0) (9.127ยตs)
I1223 19:14:31.881685 937434 strace.go:564] [ 15: 15] python3 E fstat(0x49 /dev/shm/ZVh4Mj, 0x7ef4b30c50b0)
I1223 19:14:31.881708 937434 strace.go:602] [ 15: 15] python3 X fstat(0x49 /dev/shm/ZVh4Mj, 0x7ef4b30c50b0 {dev=24, ino=27, mode=S_IFREG|0o600, nlink=2, uid=0, gid=0, rdev=0, size=32, blksize=4096, blocks=1, atime=2023-12-23 19:14:31.88147723 +0000 UTC, mtime=2023-12-23 19:14:31.881553145 +0000 UTC, ctime=2023-12-23 19:14:31.881553145 +0000 UTC}) = 0 (0x0) (2.177ยตs)
I1223 19:14:31.882240 937434 strace.go:561] [ 15: 15] python3 E unlink(0x7ef4b30c5190 /dev/shm/ZVh4Mj)
I1223 19:14:31.882362 937434 strace.go:599] [ 15: 15] python3 X unlink(0x7ef4b30c5190 /dev/shm/ZVh4Mj) = 0 (0x0) (12.97ยตs)
I1223 19:14:31.882448 937434 strace.go:561] [ 15: 15] python3 E close(0x49 /dev/shm/ZVh4Mj (deleted))
I1223 19:14:31.882459 937434 strace.go:599] [ 15: 15] python3 X close(0x49 /dev/shm/ZVh4Mj (deleted)) = 0 (0x0) (2.197ยตs)
I1223 19:14:31.882599 937434 strace.go:561] [ 15: 15] python3 E unlink(0x7ef4b30c5270 /dev/shm/sem.mp-uq5k_1mc)
I1223 19:14:31.882633 937434 strace.go:599] [ 15: 15] python3 X unlink(0x7ef4b30c5270 /dev/shm/sem.mp-uq5k_1mc) = 0 (0x0) (13.497ยตs)
These files are created in /dev/shm
, written to, then mmap(2)-ed, then hard linked, then deleted (both the original path and the link path). The pages are referenced from the mmap(2)-ed VMA. In fact, there are 94 such deleted /dev/shm files that still have an active VMA due to mmap(2). How does the application plan on munmap(2)-ing these VMAs?
Hey, just got back from PTO. Thanks again for your investigation. Looks like I should incorporate the related fixes into our runtime and see what I get with the repro program.
I added a modified version of the Dockerfile
above to https://github.com/google/gvisor/blob/e6a42ae59450fe561cb37eb23ca14e5202067637/images/gpu/pytorch/, but currently disabled since the bug isn't fixed yet, and because this Python script runs forever when it does work (with runc
).
Can you assist with making it run for just a small amount of time under runc
? I tried reducing the model size and using fast_dev_run=True
but it doesn't seem to have any effect. If we can limit it to a few seconds of test time under runc
, then we can turn it into a permanent regression test for gVisor.
I think /dev/shm
is just undersized; each /dev/shm/torch_*
file is ~77MB. Oddly, with /dev/shm
set to a larger size (and with #9828 applied), the test passes under runsc but still hangs under runc:
$ sudo docker run --gpus all --runtime=runsc --shm-size=128g sha256:9ecb6090c8fae027985f0b7bf5be27ec4438a0ac6d6ba966bafea55f0fd3d772
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='19:13:03') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|โโโโโโโโโโ| 169001437/169001437 [00:02<00:00, 72318989.98it/s]Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|โโโโโโโโโโ| 97.8M/97.8M [00:00<00:00, 112MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:67: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M Trainable params
0 Non-trainable params
23.7 M Total params
94.852 Total estimated model params size (MB)
Epoch 0: 100%|โโโโโโโโโโ| 391/391 [07:46<00:00, 0.84it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|โโโโโโโโโโ| 391/391 [07:47<00:00, 0.84it/s, v_num=0]
Training duration (seconds): 491.89744663238525
Per Jamie's findings, the reproducer should now work with gVisor (after increasing /dev/shm size limit).
There are TODOs still referencing this issue:
Search TODO
Thanks for all your great help here! Still meaning to look into https://github.com/google/gvisor/blob/HEAD/images/gpu/pytorch/issue_9827.py#L87
We tested this in A100s with 40 and 80GB and the merged fix works.
Revisting the currently skipped test added during this issue's investigation. Could use your guidance @ayushr2 on how to get debug logs enabled for test executions. I've tried the typical way (daemon.json and reloading docker) but that has not done anything. I read through the test code a bit and didn't see anything.
I'm running the test on an A100 80GB SXM4 server and getting SIGSEV:
ubuntu@207-211-185-116:~/gvisor$ make test TARGETS="//test/gpu:pytorch_test"
--- TAG default
--- DOCKER BUILD
gvisor-builder-7a84ed41-x86_64
sha256:0c0586de8611f38f0fd8041a6c3aaebc65685cc4ac32b8742bec947a12224c89
--- DOCKER RUN
gvisor-bazel-7a84ed41-x86_64
1a4bf151325dd1c159f5338141e21e4de689b5e8000641adcca48d86398d35da
--- TEST //test/gpu:pytorch_test
Another command holds the client lock:
pid=139701
owner=client
cwd=/home/ubuntu/gvisor
Waiting for it to complete...
Another command (pid=139701) is running. Waiting for it to complete on the server (server_pid=139703)...
Loading:
Loading:
Loading: 1 packages loaded
Analyzing: target //test/gpu:pytorch_test (2 packages loaded, 0 targets configured)
Analyzing: target //test/gpu:pytorch_test (63 packages loaded, 767 targets configured)
Analyzing: target //test/gpu:pytorch_test (84 packages loaded, 11204 targets configured)
Analyzing: target //test/gpu:pytorch_test (435 packages loaded, 15706 targets configured)
INFO: Analyzed target //test/gpu:pytorch_test (440 packages loaded, 17542 targets configured).
INFO: Found 1 test target...
[2 / 25] [Prepa] BazelWorkspaceStatusAction stable-status.txt
[1,605 / 1,606] Testing //test/gpu:pytorch_test; 0s linux-sandbox
[1,605 / 1,606] Testing //test/gpu:pytorch_test; 11s linux-sandbox
FAIL: //test/gpu:pytorch_test (see /home/ubuntu/.cache/bazel/_bazel_ubuntu/9cf5d2f37e613aac6d73fb8ae06b7c50/execroot/__main__/bazel-out/k8-fastbuild/testlogs/test/gpu/pytorch_test/test.log)
[1,605 / 1,606] 1 / 1 tests, 1 failed; Testing //test/gpu:pytorch_test; 38s linux-sandbox
INFO: From Testing //test/gpu:pytorch_test:
==================== Test output for //test/gpu:pytorch_test:
--- FAIL: TestIssue9827 (21.07s)
pytorch_test.go:60: Failed: container returned non-zero status: 1, msg: ""
Container output:
==========
== CUDA ==
==========
CUDA Version 12.2.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='02:53:39') parent_process=None
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Missing logger folder: /lightning_logs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Missing logger folder: /lightning_logs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Missing logger folder: /lightning_logs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Missing logger folder: /lightning_logs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Missing logger folder: /lightning_logs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Missing logger folder: /lightning_logs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Missing logger folder: /lightning_logs
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
| Name | Type | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M Trainable params
0 Non-trainable params
23.7 M Total params
94.852 Total estimated model params size (MB)
Traceback (most recent call last):
File "/issue_9827.py", line 88, in <module>
trainer.fit(model, train_dl, val_dl)
File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/lightning/pytorch/strategies/launchers/multiprocessing.py", line 144, in launch
while not process_context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV
FAIL
================================================================================
Target //test/gpu:pytorch_test up-to-date:
bazel-bin/test/gpu/pytorch_test_/pytorch_test
INFO: Elapsed time: 46.666s, Critical Path: 39.11s
INFO: 2 processes: 1 internal, 1 linux-sandbox.
INFO: Build completed, 1 test FAILED, 2 total actions
//test/gpu:pytorch_test FAILED in 38.6s
/home/ubuntu/.cache/bazel/_bazel_ubuntu/9cf5d2f37e613aac6d73fb8ae06b7c50/execroot/__main__/bazel-out/k8-fastbuild/testlogs/test/gpu/pytorch_test/test.log
Executed 1 out of 1 test: 1 fails locally.
INFO: Build completed, 1 test FAILED, 2 total actions
INFO: Build Event Protocol files produced successfully.
make: *** [Makefile:64: test] Error 3
how to get debug logs enabled for test executions
I usually look at the Makefile and Buildkite configuration to see how the make
command is being used. In this case, we have a Makefile target named gpu-all-tests
: https://github.com/google/gvisor/blob/a5b10b7dd04c309680b9c4c69080a7aae0f87074/Makefile#L304-L310
And see how this is invoked in Buildkite: https://github.com/google/gvisor/blob/a5b10b7dd04c309680b9c4c69080a7aae0f87074/.buildkite/pipeline.yaml#L180-L185
So for your usecase, maybe comment out Line 307-309 in the Makefile and invoke make gpu-all-tests RUNTIME_ARGS="--debug"
. Note that the flag --debug-log
is automatically added in https://github.com/google/gvisor/blob/a5b10b7dd04c309680b9c4c69080a7aae0f87074/Makefile#L141. The Makefile will reload the Docker daemon and everything.
To find the debug logs, see what --debug-log
flag was used in /etc/docker/daemon.json
and you will find all logs there. Hope that's helpful.
Thanks a lot ๐
Description
When running multi-GPU training on A100s applications can attempt to use the unimplemented
NV50_P2P
allocation class.This presents as a
'mapping of buffer object failed'
error.I've take a look at the implementation of this allocation class and unfortunately it's non-trivial: https://github.dev/NVIDIA/open-gpu-kernel-modules/blob/4c29105335610933e744f4ab2524ea63fc39edaf/src/common/sdk/nvidia/inc/class/cl503b.h#L57
Opening this as a tracking issue.
Steps to reproduce
Dockerfile
This Dockerfile runs but I wasn't able to use it to reproduce the issue because I couldn't get an on-demand multi-GPU A100 VM in GCP ๐.
On Modal this script reproduces the issue most of the time. I was able to observe the
unknown allocation class
logline by running this script with debug logs enabled and then jumping onto the worker it ran on to read them.runsc version
docker version (if using docker)
No response
uname
No response
runsc debug logs (if available)
Omitted. Logs show
unknown allocation class
for0x0000503b
, which isNV50_P2P
.