Description

We had a bug that was making impossible using InferencePipeline inside inference server with CUDA support - this is a side effect of well-known problem with CUDA and Python multiprocessing. Default process start method inside our container is fork, but spawn is recommended to make CUDA work.

We do not use basically any of parent resources in stream manager APP (managing separate InferencePipeline processes inside inference server - so for manager process - and all its children we set spawn not affecting the server process itself). Thanks to this change, we can successfully spawn multiple downstream InferencePipeline processes.

As far as I tested, everything works as it should, but we need to investigate the issue over time, as the change may have minor side-effects that we do not see now, for instance in the nuances of the workflows behaviour

[!CAUTION] One negative side effect is that spawning each and every downstream process with InferencePipeline inside inference server now takes 15s - and this is known side-effect, yet the scale in our case is really big 😢 unfortunately - this seems to be a way to go in our setup.

[!IMPORTANT] Since change in process start method from fork to spawn introduced so much latency, introduced daemon thread that keeps at least n idle processes ready to serve as pipeline workers - this way, latency on user end is minimised to model load time and camera connection - ~2s instead of over 15s

Type of change

Please delete options that are not relevant.

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

e2e tests on T4 GPU
to be tested in broader scope, including Jetsons

Any specific deployment considerations

For example, documentation changes, usability, usage/costs, secrets, etc.

Docs

[ ] Docs updated? What were the changes:

roboflow / inference

Fix: CUDA context was failing to load in child-process #836

Description

Type of change

How has this change been tested, please provide a testcase or example of how you tested the change?

Any specific deployment considerations

Docs