microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.75k stars 2.79k forks source link

Failed to allocated memory for requested buffer of size X #20038

Open aaditya-srivathsan opened 4 months ago

aaditya-srivathsan commented 4 months ago

So I was trying to deploy a custom model on the tritonserver(23.08) with the onnxruntime_backend(onnxruntime version 1.15.1). But while doing so, we are facing this issue:

onnx runtime error 6: Non-zero status code returned while running Mul node. Name:\'Mul_8702\' Status Message: /workspace/onnxruntime/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 2830172160\

There are 7 other models also hosted on the the same server and those work fine(even under stress) but things break once this new model is added. Any idea why this might be happening? The server is also hosted in a T4 gpu and these are our current stats:

| model_control_mode               | MODE_NONE                                                                                                                                             │
│                                                            |                                                                                                                               │
│ | strict_model_config              | 0                                                                                                                                                     │
│                                                            |                                                                                                                               │
│ | rate_limit                       | OFF                                                                                                                                                   │
│                                                            |                                                                                                                               │
│ | pinned_memory_pool_byte_size     | 268435456                                                                                                                                             │
│                                                            |                                                                                                                               │
│ | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                              │
│                                                            |                                                                                                                               │
│ | min_supported_compute_capability | 6.0                                                                                                                                                   │
│                                                            |                                                                                                                               │
│ | strict_readiness                 | 1                                                                                                                                                     │
│                                                            |                                                                                                                               │
│ | exit_timeout                     | 30                                                                                                                                                    │
│                                                            |                                                                                                                               │
│ | cache_enabled                    | 0                                                                                                                                                     │
│                                                            |         

Separately, while testing things out without the tritonserver setup and using the onnxruntime session, we were able to see that the model, despite the size of the onnx file being 350MB and the input shape being [3, 1280, 1280], we still see the GPU mem consumption jump upto 9GB[this is with FP 32, reducing it to FP 16 still shows 5GB usage for a single batch] after a single request(batch_size = 1 and for ref the actual use batch size for inference is 8)

Screenshot 2024-03-21 at 3 35 10 PM

Any help on understanding why this might be caused and how to fix this will be appreciated Thanks!

hariharans29 commented 4 months ago

1) By default, the BFCArena (ORT's memory pool implementation) is used to allocate the weights (initializers) and it could have grown quite a bit during the weights allocation process. Usually, this is not so bad, as the "unused" memory in the pool will be used to service Run() requests. But, you have a case where you are hosting multiple models on the same server and depending on the memory usage of Run() for each of these models - some portion of the memory in the pool might be wasted for each of the models. To cut down on this, I suggest using this option to by-pass the memory arena for weights(Usage example: https://github.com/microsoft/onnxruntime/blob/d30c81d270894f41ccce7b102b1d4aedd9e628b1/onnxruntime/test/shared_lib/test_inference.cc#L3065). This ensures that the weights are not allocated through the memory pool (and hence doesn't grow the memory pool during the weights' allocation) and the memory pool's growth is only a function of the memory usage during Run() itself. Also keep in mind, that the first Run() might be a tad slower with this option since this is the Run() where the memory pool actually grows (as opposed to the previous growth during session initialization). This is one way to ensure minimal memory wastage while hosting multiple models on the same server.

2) The second thing to try is to tweak the arena's extension strategy - By default, the arena's extension strategy may make it sub-optimal for the scenario you have. Try changing it to "kSameAsRequested" to be more economical wrt to memory growth.

(1) or (2) (or both) might help in your usage scenario

aaditya-srivathsan commented 4 months ago

Thanks @hariharans29 Let me try one of these 2 approaches to see if that helps me at all

aaditya-srivathsan commented 4 months ago

@hariharans29 so despite enabling the 2 options, I still see the exact same error. In my config.pbtxt file, I am passing the 2 as a parameter

parameters { key: "arena_extend_strategy" value: { string_value: "1" } }
parameters { key: "use_device_allocator_for_initializers" value: { string_value: "1" } }I

Any idea how to further debug this?

hariharans29 commented 4 months ago

Is config.pbtxt file the way to specify the ORT options to tritonserver ? If so, I am not sure whether support for these options have been enabled on tritonserver. Please check with relevant folks on this.

I would suggest trying these options in the standalone ORT setup you have and study the differences it has with baseline. It should make some difference in the amount of memory allocated - how subtle or marked is something I don't know.

pranavsharma commented 4 months ago

This is the list of options supported by Triton's ORT backend: https://github.com/triton-inference-server/onnxruntime_backend?tab=readme-ov-file#model-config-options.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.