[Performance] How to free GPU memory for transformers ONNX models

Gaggi72 commented 9 months ago

Describe the issue

I am using the sentence-transformers model with onnx runtime for inferencing embeddings. I have created a FastAPI app on which app startup initialises the Inference session of onnx runtime. Whenever there are new tokens given for embedding creation it occupies GPU memory which is not released after successful execution. I have changed the gpu_mem_limit but still it exceeds it after k iterations. Also, io_bindings are being used for inferencing which get cleared at every API call. How I can free GPU memory

To reproduce

def update_ep_options(provider, model):
        default_ep_options = model.get_provider_options()[provider]
        ep_options = {
          "gpu_mem_limit": "2147483648",
          "arena_extend_strategy": "kSameAsRequested"
        }
        for k, v in ep_options.items():
            default_ep_options[k] = v
        return default_ep_options

      session_options = onnxruntime.SessionOptions()
      session_options.intra_op_num_threads = 8

      model = onnxruntime.InferenceSession(path_or_bytes=model_path, sess_options=session_options,
                                                providers=['CUDAExecutionProvider'])

      tokenizer = AutoTokenizer.from_pretrained(self.model_name, do_lower_case=True)

      ep_options = update_ep_options(provider='CUDAExecutionProvider', model)
      model.set_providers(['CUDAExecutionProvider'], [ep_options])
      run_options = onnxruntime.RunOptions()
      run_options.add_run_config_entry("memory.enable_memory_arena_shrinkage",
                                            "gpu:0")

def encode(data: list):
        try:
            inputs = tokenize(data)
            self.bind_inputs_to_device(inputs=inputs)
            start_, end = self.bind_outputs_to_device(input_ids_shape=inputs['input_ids'].shape)
            self.model.run_with_iobinding(self.io_binding, self.run_options)
            embeddings = self.mean_pooling([start_], inputs['attention_mask']).cpu().detach().numpy()
            self.io_binding.clear_binding_inputs()
            self.io_binding.clear_binding_outputs()
            return embeddings

        except Exception as e:
            logger.error(f"Error while encoding-> {traceback.format_exc()} --- {e}")

Urgency

No response

Platform

Linux

OS Version

20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.3

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

11.8

Model File

No response

Is this a quantized model?

No

pranavsharma commented 9 months ago

This should help. Try setting this run option. https://github.com/microsoft/onnxruntime/blob/e9ab56fa64c0644a2dc5287d1cd3b945bf7d7981/include/onnxruntime/core/session/onnxruntime_run_options_config_keys.h#L19-L27

newgrit1004 commented 8 months ago

@pranavsharma, Thank you very much. It works for me.

The example python code is here. https://github.com/microsoft/onnxruntime/blob/4bfa69def85476b33ccfaf68cf070f3fb65d39f7/onnxruntime/test/python/onnxruntime_test_python.py#L1586

By the way, the inference speed using this option is slower than what I expected.

Do you have any other options to avoid decreasing inference speed?

pranavsharma commented 8 months ago

Shouldn't have impacted the inference speed. Can you tell where the extra time is being spent? Probably in the memory cleanup after the inferencing is complete?

hariharans29 commented 8 months ago

Some decrease in speed is expected as the memory cleanup involves invoking multiple (in most cases) cudaFree() calls and that cost is cooked into the Run() call. To best use this feature, it is important to not allocate weights through the memory pool (arena) and set a high enough "inital" chunk size for the arena such that "most" Run() calls can be serviced with that initial chunk and does not invoke shrinkage but if there are outlier Run() calls that allocates more memory than initial chunk (some model that processes large sequnces for example) - that will be freed at the end of the Run(). The "initial" chunk is never freed until the end of the session.

Please see detailed coment here - https://github.com/microsoft/onnxruntime/issues/9509#issuecomment-951546580

newgrit1004 commented 8 months ago

@pranavsharma yes, it is. @hariharans29 's answer is worth to me. I understood the context. I appreciate that.

microsoft / onnxruntime