Closed shimoshida closed 1 year ago
21.07 looks like little old. Can you reproduce the issue with the latest triton release(22.03)?
Can you share the output of dmesg
so that we can know what was the reason tritonserver got killed?
@tanmayv25 thanks for the reply. I have investigated it with 21.07 and latest Triton.
I confirm that MIG is disabled correctly via nvidia-smi
.
logs of Triton startup(workspace
is empty dir)
root@sample-triton-latest-74fccc696d-58tgs:/opt/tritonserver# mpirun -n 1 --allow-run-as-root tritonserver --model-repository=/workspace
I0406 05:21:45.088080 4518 libtorch.cc:1309] TRITONBACKEND_Initialize: pytorch
I0406 05:21:45.088196 4518 libtorch.cc:1319] Triton TRITONBACKEND API version: 1.8
I0406 05:21:45.088200 4518 libtorch.cc:1325] 'pytorch' TRITONBACKEND API version: 1.8
2022-04-06 05:21:45.376048: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-04-06 05:21:45.411645: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0406 05:21:45.411737 4518 tensorflow.cc:2176] TRITONBACKEND_Initialize: tensorflow
I0406 05:21:45.411762 4518 tensorflow.cc:2186] Triton TRITONBACKEND API version: 1.8
I0406 05:21:45.411768 4518 tensorflow.cc:2192] 'tensorflow' TRITONBACKEND API version: 1.8
I0406 05:21:45.411772 4518 tensorflow.cc:2216] backend configuration:
{}
I0406 05:21:45.414874 4518 onnxruntime.cc:2319] TRITONBACKEND_Initialize: onnxruntime
I0406 05:21:45.414905 4518 onnxruntime.cc:2329] Triton TRITONBACKEND API version: 1.8
I0406 05:21:45.414909 4518 onnxruntime.cc:2335] 'onnxruntime' TRITONBACKEND API version: 1.8
I0406 05:21:45.414912 4518 onnxruntime.cc:2365] backend configuration:
{}
I0406 05:21:45.471530 4518 openvino.cc:1207] TRITONBACKEND_Initialize: openvino
I0406 05:21:45.471547 4518 openvino.cc:1217] Triton TRITONBACKEND API version: 1.8
I0406 05:21:45.471551 4518 openvino.cc:1223] 'openvino' TRITONBACKEND API version: 1.8
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node sample-triton-latest-74fccc696d-58tgs exited on signal 9 (Killed).
--------------------------------------------------------------------------
dmesg
dmesg
Looks like your system is going out of memory.
[45019.521686] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[45019.521688] [ 7857] 65535 7857 241 1 28672 0 -998 pause
[45019.521690] [ 9563] 0 9563 974 720 49152 0 999 bash
[45019.521691] [ 60753] 0 60753 1060 887 45056 0 999 bash
[45019.521692] [ 60919] 0 60919 630 147 40960 0 999 sleep
[45019.521694] [ 60992] 0 60992 25458 2426 77824 0 999 mpirun
[45019.521695] [ 60997] 0 60997 9695196 83217 1458176 0 999 tritonserver
[45019.521695] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=a82e417e149bceb2f24a3c73f90fd97b033185a234846eff07c07c4e32ca59a0,mems_allowed=0-3,oom_memcg=/kubepods/burstable/podcb040dc7-7546-49e2-ae77-320dc45ed2ce,task_memcg=/kubepods/burstable/podcb040dc7-7546-49e2-ae77-320dc45ed2ce/a82e417e149bceb2f24a3c73f90fd97b033185a234846eff07c07c4e32ca59a0,task=tritonserver,pid=60997,uid=0
[45019.521714] Memory cgroup out of memory: Killed process 60997 (tritonserver) total-vm:38780784kB, anon-rss:96564kB, file-rss:236304kB, shmem-rss:0kB, UID:0 pgtables:1424kB oom_score_adj:999
[45019.542213] oom_reaper: reaped process 60997 (tritonserver), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Try specifying lower --pinned-memory-pool-byte-size
?
With an empty model repository no models are being actually loaded. I wonder where the extra memory is being spent.
@GuanLuo Do you have any idea?
I don't know why it may cause OOM even when no model is loaded. Would it also worth a try to remove the backend shared libraries so Triton will start without loading any framework libraries?
@tanmayv25 Is this command correct? The result is the same.
root@sample-triton-only-747d5f564b-vqk6s:/opt/tritonserver# mpirun -n 1 --allow-run-as-root tritonserver --model-repository=/workspace --pinned-memory-pool-byte-size=20000000000
I0411 03:03:57.509330 46959 metrics.cc:290] Collecting metrics for GPU 0: NVIDIA A100-SXM4-40GB
I0411 03:03:57.509551 46959 metrics.cc:290] Collecting metrics for GPU 1: NVIDIA A100-SXM4-40GB
I0411 03:03:57.509568 46959 metrics.cc:290] Collecting metrics for GPU 2: NVIDIA A100-SXM4-40GB
I0411 03:03:57.509576 46959 metrics.cc:290] Collecting metrics for GPU 3: NVIDIA A100-SXM4-40GB
I0411 03:03:57.509585 46959 metrics.cc:290] Collecting metrics for GPU 4: NVIDIA A100-SXM4-40GB
I0411 03:03:57.509593 46959 metrics.cc:290] Collecting metrics for GPU 5: NVIDIA A100-SXM4-40GB
I0411 03:03:57.509602 46959 metrics.cc:290] Collecting metrics for GPU 6: NVIDIA A100-SXM4-40GB
I0411 03:03:57.509610 46959 metrics.cc:290] Collecting metrics for GPU 7: NVIDIA A100-SXM4-40GB
I0411 03:03:57.993612 46959 libtorch.cc:998] TRITONBACKEND_Initialize: pytorch
I0411 03:03:57.993646 46959 libtorch.cc:1008] Triton TRITONBACKEND API version: 1.4
I0411 03:03:57.993650 46959 libtorch.cc:1014] 'pytorch' TRITONBACKEND API version: 1.4
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node sample-triton-only-747d5f564b-vqk6s exited on signal 9 (Killed).
--------------------------------------------------------------------------
dmesg
[468744.708634] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=92205d550d71282b3f6d25fd1844c3d252b0ceee832ed33832d1e8a4ad0c8a76,mems_allowed=0-3,oom_memcg=/kubepods/burstable/pod45dbec92-0481-4b70-b999-158b7cf2f909,task_memcg=/kubepods/burstable/pod45dbec92-0481-4b70-b999-158b7cf2f909/92205d550d71282b3f6d25fd1844c3d252b0ceee832ed33832d1e8a4ad0c8a76,task=tritonserver,pid=59345,uid=0
[468744.708698] Memory cgroup out of memory: Killed process 59345 (tritonserver) total-vm:39297024kB, anon-rss:85992kB, file-rss:227752kB, shmem-rss:0kB, UID:0 pgtables:1364kB oom_score_adj:999
[468744.724455] oom_reaper: reaped process 59345 (tritonserver), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Would it also worth a try to remove the backend shared libraries so Triton will start without loading any framework libraries?
I'm sorry I cannot understand this. What should I do for investigation?
So Triton will attempt to load some backend (framework) libraries when it starts (i.e. I0404 11:55:52.570841 69 tensorflow.cc:2169] TRITONBACKEND_Initialize: tensorflow
), and if you remove all the backends shipped in /opt/tritonserver/backends/
(rm -r /opt/tritonserver/backends/*
), then Triton will have to start without any backends:
I0408 23:50:11.814732 817 server.cc:576]
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+
We can narrow down the scope depending on whether Triton starts successfully.
@GuanLuo Thank you for the help!
I have tried to delete all backend library, but OOM occurs(Triton version is 21.07
).
root@sample-triton-only-747d5f564b-vqk6s:/opt/tritonserver# rm -r /opt/tritonserver/backends/*
root@sample-triton-only-747d5f564b-vqk6s:/opt/tritonserver# ls /opt/tritonserver/backends/
root@sample-triton-only-747d5f564b-vqk6s:/opt/tritonserver# mpirun -n 1 --allow-run-as-root tritonserver --model-repository=/
I0412 08:09:13.073400 57484 metrics.cc:290] Collecting metrics for GPU 0: NVIDIA A100-SXM4-40GB
I0412 08:09:13.073633 57484 metrics.cc:290] Collecting metrics for GPU 1: NVIDIA A100-SXM4-40GB
I0412 08:09:13.073650 57484 metrics.cc:290] Collecting metrics for GPU 2: NVIDIA A100-SXM4-40GB
I0412 08:09:13.073658 57484 metrics.cc:290] Collecting metrics for GPU 3: NVIDIA A100-SXM4-40GB
I0412 08:09:13.073666 57484 metrics.cc:290] Collecting metrics for GPU 4: NVIDIA A100-SXM4-40GB
I0412 08:09:13.073674 57484 metrics.cc:290] Collecting metrics for GPU 5: NVIDIA A100-SXM4-40GB
I0412 08:09:13.073687 57484 metrics.cc:290] Collecting metrics for GPU 6: NVIDIA A100-SXM4-40GB
I0412 08:09:13.073695 57484 metrics.cc:290] Collecting metrics for GPU 7: NVIDIA A100-SXM4-40GB
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node sample-triton-only-747d5f564b-vqk6s exited on signal 9 (Killed).
--------------------------------------------------------------------------
root@sample-triton-only-747d5f564b-vqk6s:/opt/tritonserver# tritonserver --model-repository=/
I0412 08:09:30.787453 57495 metrics.cc:290] Collecting metrics for GPU 0: NVIDIA A100-SXM4-40GB
I0412 08:09:30.787684 57495 metrics.cc:290] Collecting metrics for GPU 1: NVIDIA A100-SXM4-40GB
I0412 08:09:30.787702 57495 metrics.cc:290] Collecting metrics for GPU 2: NVIDIA A100-SXM4-40GB
I0412 08:09:30.787715 57495 metrics.cc:290] Collecting metrics for GPU 3: NVIDIA A100-SXM4-40GB
I0412 08:09:30.787729 57495 metrics.cc:290] Collecting metrics for GPU 4: NVIDIA A100-SXM4-40GB
I0412 08:09:30.787742 57495 metrics.cc:290] Collecting metrics for GPU 5: NVIDIA A100-SXM4-40GB
I0412 08:09:30.787757 57495 metrics.cc:290] Collecting metrics for GPU 6: NVIDIA A100-SXM4-40GB
I0412 08:09:30.787771 57495 metrics.cc:290] Collecting metrics for GPU 7: NVIDIA A100-SXM4-40GB
Killed
dmesg
[573399.692350] Tasks state (memory values in pages):
[573399.692350] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[573399.692353] [ 8132] 65535 8132 241 1 28672 0 -998 pause
[573399.692354] [ 11938] 0 11938 994 756 45056 0 999 bash
[573399.692356] [ 55564] 0 55564 1060 927 45056 0 999 bash
[573399.692357] [ 56603] 0 56603 627 130 45056 0 999 sleep
[573399.692359] [ 56667] 0 56667 8896554 61104 897024 0 999 tritonserver
[573399.692360] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=92205d550d71282b3f6d25fd1844c3d252b0ceee832ed33832d1e8a4ad0c8a76,mems_allowed=0-3,oom_memcg=/kubepods/burstable/pod45dbec92-0481-4b70-b999-158b7cf2f909,task_memcg=/kubepods/burstable/pod45dbec92-0481-4b70-b999-158b7cf2f909/92205d550d71282b3f6d25fd1844c3d252b0ceee832ed33832d1e8a4ad0c8a76,task=tritonserver,pid=56667,uid=0
[573399.692432] Memory cgroup out of memory: Killed process 56667 (tritonserver) total-vm:35586216kB, anon-rss:77612kB, file-rss:158612kB, shmem-rss:8192kB, UID:0 pgtables:876kB oom_score_adj:999
[573399.704863] oom_reaper: reaped process 56667 (tritonserver), now anon-rss:0kB, file-rss:77084kB, shmem-rss:8192kB
The same results by 22.03
root@sample-triton-latest-74fccc696d-58tgs:/opt/tritonserver# rm -r /opt/tritonserver/backends/*
root@sample-triton-latest-74fccc696d-58tgs:/opt/tritonserver# ls /opt/tritonserver/backends/
root@sample-triton-latest-74fccc696d-58tgs:/opt/tritonserver# mpirun -n 1 --allow-run-as-root tritonserver --model-repository=/
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node sample-triton-latest-74fccc696d-58tgs exited on signal 9 (Killed).
--------------------------------------------------------------------------
root@sample-triton-latest-74fccc696d-58tgs:/opt/tritonserver# tritonserver --model-repository=/
Killed
dmesg
[573621.401513] Tasks state (memory values in pages):
[573621.401514] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[573621.401521] [ 7857] 65535 7857 241 1 28672 0 -998 pause
[573621.401527] [ 9563] 0 9563 974 720 49152 0 999 bash
[573621.401534] [ 60239] 0 60239 1060 887 49152 0 999 bash
[573621.401537] [ 60753] 0 60753 630 130 45056 0 999 sleep
[573621.401539] [ 60859] 0 60859 8735457 49696 569344 0 999 tritonserver
[573621.401548] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=a82e417e149bceb2f24a3c73f90fd97b033185a234846eff07c07c4e32ca59a0,mems_allowed=0-3,oom_memcg=/kubepods/burstable/podcb040dc7-7546-49e2-ae77-320dc45ed2ce,task_memcg=/kubepods/burstable/podcb040dc7-7546-49e2-ae77-320dc45ed2ce/a82e417e149bceb2f24a3c73f90fd97b033185a234846eff07c07c4e32ca59a0,task=tritonserver,pid=60859,uid=0
[573621.401589] Memory cgroup out of memory: Killed process 60859 (tritonserver) total-vm:34941828kB, anon-rss:11588kB, file-rss:104700kB, shmem-rss:82496kB, UID:0 pgtables:556kB oom_score_adj:999
[573621.411454] oom_reaper: reaped process 60859 (tritonserver), now anon-rss:0kB, file-rss:77084kB, shmem-rss:82496kB
[573621.453318] Cannot map memory with base addr 0x7f13be000000 and size of 0x10000 pages
[573621.453339] NVRM: failed to copy out ioctl data
Can you try these experiments and share your findings?
tritonserver --model-repository=/workspace --pinned-memory-pool-byte-size=N
, does it work for following runs? Run1: N = 0 Run2: N = 1000 Run3: N = 1000000
What is the total memory available in the system?
@tanmayv25
Triton version: 21.07
Machine spec:
900
GiB
https://docs.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series
top - 04:41:13 up 12 min, 0 users, load average: 0.20, 0.81, 0.69
Tasks: 4 total, 1 running, 3 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni, 99.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 907082.2 total, 855742.9 free, 5092.4 used, 46246.9 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 897293.8 avail Mem
Delete Libraries:
root@sample-triton-only-747d5f564b-b8vdl:/opt/tritonserver# rm -r /opt/tritonserver/backends/*
root@sample-triton-only-747d5f564b-b8vdl:/opt/tritonserver# ls /opt/tritonserver/backends/
root@sample-triton-only-747d5f564b-b8vdl:/opt/tritonserver# ls /workspace
ls: cannot access '/workspace': No such file or directory
N=0
root@sample-triton-only-747d5f564b-b8vdl:/opt/tritonserver# tritonserver --model-repository=/workspace --pinned-memory-pool-byte-size=0
I0414 04:38:54.287572 69 metrics.cc:290] Collecting metrics for GPU 0: NVIDIA A100-SXM4-40GB
I0414 04:38:54.287879 69 metrics.cc:290] Collecting metrics for GPU 1: NVIDIA A100-SXM4-40GB
I0414 04:38:54.287896 69 metrics.cc:290] Collecting metrics for GPU 2: NVIDIA A100-SXM4-40GB
I0414 04:38:54.287909 69 metrics.cc:290] Collecting metrics for GPU 3: NVIDIA A100-SXM4-40GB
I0414 04:38:54.287918 69 metrics.cc:290] Collecting metrics for GPU 4: NVIDIA A100-SXM4-40GB
I0414 04:38:54.287937 69 metrics.cc:290] Collecting metrics for GPU 5: NVIDIA A100-SXM4-40GB
I0414 04:38:54.287950 69 metrics.cc:290] Collecting metrics for GPU 6: NVIDIA A100-SXM4-40GB
I0414 04:38:54.287964 69 metrics.cc:290] Collecting metrics for GPU 7: NVIDIA A100-SXM4-40GB
I0414 04:38:56.477714 69 pinned_memory_manager.cc:244] Pinned memory pool disabled
I0414 04:38:56.495776 69 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0414 04:38:56.495790 69 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
I0414 04:38:56.495794 69 cuda_memory_manager.cc:105] CUDA memory pool is created on device 2 with size 67108864
I0414 04:38:56.495797 69 cuda_memory_manager.cc:105] CUDA memory pool is created on device 3 with size 67108864
I0414 04:38:56.495809 69 cuda_memory_manager.cc:105] CUDA memory pool is created on device 4 with size 67108864
I0414 04:38:56.495817 69 cuda_memory_manager.cc:105] CUDA memory pool is created on device 5 with size 67108864
I0414 04:38:56.495822 69 cuda_memory_manager.cc:105] CUDA memory pool is created on device 6 with size 67108864
I0414 04:38:56.495827 69 cuda_memory_manager.cc:105] CUDA memory pool is created on device 7 with size 67108864
Killed
N=1000
root@sample-triton-only-747d5f564b-b8vdl:/opt/tritonserver# tritonserver --model-repository=/workspace --pinned-memory-pool-byte-size=1000
I0414 04:40:07.772539 87 metrics.cc:290] Collecting metrics for GPU 0: NVIDIA A100-SXM4-40GB
I0414 04:40:07.772764 87 metrics.cc:290] Collecting metrics for GPU 1: NVIDIA A100-SXM4-40GB
I0414 04:40:07.772782 87 metrics.cc:290] Collecting metrics for GPU 2: NVIDIA A100-SXM4-40GB
I0414 04:40:07.772800 87 metrics.cc:290] Collecting metrics for GPU 3: NVIDIA A100-SXM4-40GB
I0414 04:40:07.772812 87 metrics.cc:290] Collecting metrics for GPU 4: NVIDIA A100-SXM4-40GB
I0414 04:40:07.772824 87 metrics.cc:290] Collecting metrics for GPU 5: NVIDIA A100-SXM4-40GB
I0414 04:40:07.772837 87 metrics.cc:290] Collecting metrics for GPU 6: NVIDIA A100-SXM4-40GB
I0414 04:40:07.772846 87 metrics.cc:290] Collecting metrics for GPU 7: NVIDIA A100-SXM4-40GB
I0414 04:40:08.074938 87 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f642b200000' with size 1000
I0414 04:40:08.090917 87 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0414 04:40:08.090930 87 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
I0414 04:40:08.090933 87 cuda_memory_manager.cc:105] CUDA memory pool is created on device 2 with size 67108864
I0414 04:40:08.090936 87 cuda_memory_manager.cc:105] CUDA memory pool is created on device 3 with size 67108864
I0414 04:40:08.090940 87 cuda_memory_manager.cc:105] CUDA memory pool is created on device 4 with size 67108864
I0414 04:40:08.090954 87 cuda_memory_manager.cc:105] CUDA memory pool is created on device 5 with size 67108864
I0414 04:40:08.090957 87 cuda_memory_manager.cc:105] CUDA memory pool is created on device 6 with size 67108864
I0414 04:40:08.090963 87 cuda_memory_manager.cc:105] CUDA memory pool is created on device 7 with size 67108864
Killed
N= 1000000
root@sample-triton-only-747d5f564b-b8vdl:/opt/tritonserver# tritonserver --model-repository=/workspace --pinned-memory-pool-byte-size=1000000
I0414 04:40:46.878367 101 metrics.cc:290] Collecting metrics for GPU 0: NVIDIA A100-SXM4-40GB
I0414 04:40:46.878591 101 metrics.cc:290] Collecting metrics for GPU 1: NVIDIA A100-SXM4-40GB
I0414 04:40:46.878608 101 metrics.cc:290] Collecting metrics for GPU 2: NVIDIA A100-SXM4-40GB
I0414 04:40:46.878622 101 metrics.cc:290] Collecting metrics for GPU 3: NVIDIA A100-SXM4-40GB
I0414 04:40:46.878635 101 metrics.cc:290] Collecting metrics for GPU 4: NVIDIA A100-SXM4-40GB
I0414 04:40:46.878651 101 metrics.cc:290] Collecting metrics for GPU 5: NVIDIA A100-SXM4-40GB
I0414 04:40:46.878665 101 metrics.cc:290] Collecting metrics for GPU 6: NVIDIA A100-SXM4-40GB
I0414 04:40:46.878678 101 metrics.cc:290] Collecting metrics for GPU 7: NVIDIA A100-SXM4-40GB
I0414 04:40:47.172029 101 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fdde9200000' with size 1000000
I0414 04:40:47.188096 101 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0414 04:40:47.188110 101 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
I0414 04:40:47.188114 101 cuda_memory_manager.cc:105] CUDA memory pool is created on device 2 with size 67108864
I0414 04:40:47.188126 101 cuda_memory_manager.cc:105] CUDA memory pool is created on device 3 with size 67108864
I0414 04:40:47.188129 101 cuda_memory_manager.cc:105] CUDA memory pool is created on device 4 with size 67108864
I0414 04:40:47.188136 101 cuda_memory_manager.cc:105] CUDA memory pool is created on device 5 with size 67108864
I0414 04:40:47.188140 101 cuda_memory_manager.cc:105] CUDA memory pool is created on device 6 with size 67108864
I0414 04:40:47.188143 101 cuda_memory_manager.cc:105] CUDA memory pool is created on device 7 with size 67108864
Killed
I see. One more experiment. Can you try this command?
tritonserver --model-repository=/workspace --pinned-memory-pool-byte-size=0 --cuda-memory-pool-byte-size=0:0 --cuda-memory-pool-byte-size=1:0 --cuda-memory-pool-byte-size=2:0 --cuda-memory-pool-byte-size=3:0 --cuda-memory-pool-byte-size=4:0 --cuda-memory-pool-byte-size=5:0 --cuda-memory-pool-byte-size=6:0 --cuda-memory-pool-byte-size=7:0
Do you still see the failure? Read more about --cuda-memory-pool-byte-size from here: https://github.com/triton-inference-server/server/blob/main/src/main.cc#L555
64 MB should not be a great deal for 40GB gpus and 900GB machine. Most likely it is an issue with your environment. Trying to narrow down the same with these experiments.
@tanmayv25 Thank you for the explanation! The result is as follows:
root@sample-triton-only-747d5f564b-h2wvt:/opt/tritonserver# rm -r /opt/tritonserver/backends/*
root@sample-triton-only-747d5f564b-h2wvt:/opt/tritonserver# ls /opt/tritonserver/backends/
root@sample-triton-only-747d5f564b-h2wvt:/opt/tritonserver# ls /workspace
ls: cannot access '/workspace': No such file or directory
root@sample-triton-only-747d5f564b-h2wvt:/opt/tritonserver# tritonserver --model-repository=/workspace --pinned-memory-pool-byte-size=0 --cuda-memory-pool-byte-size=0:0 --cuda-memory-pool-byte-size=1:0 --cuda-memory-pool-byte-size=2:0 --cuda-memory-pool-byte-size=3:0 --cuda-memory-pool-byte-size=4:0 --cuda-memory-pool-byte-size=5:0 --cuda-memory-pool-byte-size=6:0 --cuda-memory-pool-byte-size=7:0
I0415 04:00:35.583063 106 metrics.cc:290] Collecting metrics for GPU 0: NVIDIA A100-SXM4-40GB
I0415 04:00:35.583346 106 metrics.cc:290] Collecting metrics for GPU 1: NVIDIA A100-SXM4-40GB
I0415 04:00:35.583363 106 metrics.cc:290] Collecting metrics for GPU 2: NVIDIA A100-SXM4-40GB
I0415 04:00:35.583376 106 metrics.cc:290] Collecting metrics for GPU 3: NVIDIA A100-SXM4-40GB
I0415 04:00:35.583385 106 metrics.cc:290] Collecting metrics for GPU 4: NVIDIA A100-SXM4-40GB
I0415 04:00:35.583404 106 metrics.cc:290] Collecting metrics for GPU 5: NVIDIA A100-SXM4-40GB
I0415 04:00:35.583417 106 metrics.cc:290] Collecting metrics for GPU 6: NVIDIA A100-SXM4-40GB
I0415 04:00:35.583431 106 metrics.cc:290] Collecting metrics for GPU 7: NVIDIA A100-SXM4-40GB
I0415 04:00:36.749200 106 pinned_memory_manager.cc:244] Pinned memory pool disabled
I0415 04:00:36.765320 106 cuda_memory_manager.cc:115] CUDA memory pool disabled
Killed
Marking it as a bug and will investigate more into why triton is going OOM.
[44517.301476] memory: usage 131072kB, limit 131072kB, failcnt 15691
Try adding more memory to your K8s Pod via deployment.yaml to avoid OOM.
Thanks for the suggestion, Jian!
Closing due to inactivity. If you would like this issue reopened for follow-up, please let us know.
Description I want to deploy Triton server via Azure Kubernetes Service. Target node is
ND96asr v4
which is equipped with 8 A100 GPUs. Triton server without loading any models cannot startup successfully.Triton Information
nvcr.io/nvidia/tritonserver:21.07-py3
ND96asr v4
To Reproduce
deploy via deployment.yaml
login the pod and run
mpirun -n 1 --allow-run-as-root tritonserver --model-repository=/
confirm outputs
When startup without mpirun,
Killed
is observed.Expected behavior startup successfully. The following output is node with 1 gpu.