A fast, easy-to-use, production-ready inference server for computer vision supporting deployment of many popular model architectures and fine-tuned models.
We had a bug that was making impossible using InferencePipeline inside inference server with CUDA support - this is a side effect of well-known problem with CUDA and Python multiprocessing. Default process start method inside our container is fork, but spawn is recommended to make CUDA work.
We do not use basically any of parent resources in stream manager APP (managing separate InferencePipeline processes inside inference server - so for manager process - and all its children we set spawn not affecting the server process itself). Thanks to this change, we can successfully spawn multiple downstream InferencePipeline processes.
As far as I tested, everything works as it should, but we need to investigate the issue over time, as the change may have minor side-effects that we do not see now, for instance in the nuances of the workflows behaviour
[!CAUTION]
One negative side effect is that spawning each and every downstream process with InferencePipeline inside inference server now takes 15s - and this is known side-effect, yet the scale in our case is really big 😢 unfortunately - this seems to be a way to go in our setup.
[!IMPORTANT]
Since change in process start method from fork to spawn introduced so much latency, introduced daemon thread that keeps at least n idle processes ready to serve as pipeline workers - this way, latency on user end is minimised to model load time and camera connection - ~2s instead of over 15s
Type of change
Please delete options that are not relevant.
[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] This change requires a documentation update
How has this change been tested, please provide a testcase or example of how you tested the change?
e2e tests on T4 GPU
to be tested in broader scope, including Jetsons
Any specific deployment considerations
For example, documentation changes, usability, usage/costs, secrets, etc.
Description
We had a bug that was making impossible using
InferencePipeline
insideinference
server with CUDA support - this is a side effect of well-known problem with CUDA and Python multiprocessing. Default process start method inside our container isfork
, butspawn
is recommended to make CUDA work.We do not use basically any of parent resources in stream manager APP (managing separate
InferencePipeline
processes insideinference
server - so for manager process - and all its children we setspawn
not affecting the server process itself). Thanks to this change, we can successfully spawn multiple downstreamInferencePipeline
processes.As far as I tested, everything works as it should, but we need to investigate the issue over time, as the change may have minor side-effects that we do not see now, for instance in the nuances of the workflows behaviour
Type of change
Please delete options that are not relevant.
How has this change been tested, please provide a testcase or example of how you tested the change?
Any specific deployment considerations
For example, documentation changes, usability, usage/costs, secrets, etc.
Docs