triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.37k stars 1.49k forks source link

How to free multiple gpu memory #7825

Open 1120475708 opened 11 hours ago

1120475708 commented 11 hours ago

The question is how do you free memory

https://github.com/triton-inference-server/onnxruntime_backend/issues/103

When the model is deployed to a single card, I can specify real-time release of gpu memory, but if the model is deployed to multiple cards, I don't know what the format looks like

parameters { key: "memory.enable_memory_arena_shrinkage" value: { string_value: "gpu:3" }  }

instance_group [
    {
        count: 1
        kind: KIND_GPU
        gpus: [ 3 ]
    }
]
1120475708 commented 11 hours ago

I get some errors when I write it like this

parameters { key: "memory.enable_memory_arena_shrinkage" value: { string_value: "gpu:0;gpu:1;gpu:2" }  }

instance_group [
    {
        count: 1
        kind: KIND_GPU
        gpus: [ 0,1,2 ]
    }
]
 Internal desc = in ensemble 'similarity2_1', onnx runtime error 2: Did not find an arena based allocator registered for device-id  combination in the memory arena shrink list: gpu:0

 Internal desc = in ensemble 'similarity2_1', onnx runtime error 2: Did not find an arena based allocator registered for device-id  combination in the memory arena shrink list: gpu:1