Open natel9178 opened 2 years ago
Hi @natel9178 ! Thanks for thorough description of the problem. If I understand correctly, you are running 32 parallel models here. Additionally I see these in the stacktrace:
13# dali::ThreadPool::ThreadMain(int, int, bool) in /opt/tritonserver/backends/dali/dali/libdali.so
15# 0x00007F6BFFDBE609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
This would suggest a problem with CPU threads used. Could you try to reduce num_threads=1
in DALI pipeline definition and see, if this helps? Generally, CPU operations in DALI use thread-per-sample mapping. So if you have batch_size=1
there won't be any difference, still only one thread will be used.
If this does not help, could you support us with a core dump or a repro, that we can run on our side?
Hi!
I'm trying to use dali for preprocessing images to send to yolo. I have a two GPU system, that after a few minutes running dali, segfaults and dies. The Dali step is run in an triton ensemble model that sends data to a yolo model compiled to tensorrt. The stacktrace seems to indicate the problem is within dali.
Repro:
The preprocessing code is the following:
and config.pbtxt
There are about 100fps (1 batch size with 32 concurrency) of images being sent to triton, causing it to throw this error after several rounds of processing. The error reproduces after a few minutes:
Also, interestingly enough, running this with
Causes Signal (11) to show up faster (seconds) rather than minutes during ensembling over two GPUs.
Other things I've tried
I've tried multiple changes, such as
.gpu()
and setting the external source to cpu, which causes the signal 11 to show up later.Theories
Versions
NVIDIA Release 21.12 (build 30441439)
Any thoughts? Appreciate this in advance.