[Performance] A way to share weights between sessions

leugenea commented 1 year ago

Describe the issue

If I have multiple inference sessions for the same model I have to store constant weights multiple times despite the fact that they're constant and cannot be changed in runtime.

From what I see in documentation (Share initializer(s) and their ORT pre-processed version(s) between sessions section) there's a way to manually add initializer using AddInitilizer() function. And more: looks like I can use CreateSessionWithPrepackedWeightsContainer() function variant to prepack these manually added initializers.

Am I correct that to share all weights between multiple inference sessions I should:

manually export all weights and their names from ONNX model,
save and load them into tensors by myself,
add them to the first created session with AddInitilizer(),
create prepacked weight container and pass it to CreateSessionWithPrepackedWeightsContainer(),
and use this container for future sessions

So my questions are:

Is there an easier way to share weights?
Looks like this method only supports CPU provider but constant weights could be shared between multiple sessions working on the same GPU device. Is there any way to do this?

To reproduce

Load same model into one or N different sessions using CPU or CUDA provider (same for all sessions),
See that memory (RAM or GPU RAM) consumption for N sessions is ~ N× memory consumption for one session

Platform

Any

OS Version

Any

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

v1.14.1

ONNX Runtime API

Any

Architecture

X64

Execution Provider

Default CPU, CUDA

Is this a quantized model?

Should not matter

elephantpanda commented 1 year ago

Sorry, this might be a silly question, but why do you have multiple sessions for the same model? Why not have just one session and run things in batches?

leugenea commented 1 year ago

Sorry, this might be a silly question, but why do you have multiple sessions for the same model? Why not have just one session and run things in batches?

I must maintain latency of model replies so I cannot use batches. I want to have a pool of workers so I assume each worker must have it's own Session instance.

pranavsharma commented 1 year ago

You don't have to create multiple sessions for the same model; it's a huge waste of resources. Create only one session and reuse the same in different run threads. It's designed for concurrent runs.

brian-at-pieces commented 1 year ago

I actually have a similar issue. I'm finetuning a base T5 model for multiple tasks using LoRA so the majority of the weights are going to be the same between my sessions. I'm successfully using AddInitializers to replace the existing weights at runtime with the task-specific LoRA weights, but I'd like to run multiple tasks without blowing up memory so I was thinking of using this same method to create multiple session instances that shared the overlapping weights (i.e all weights besides the LorA weights)

brian-at-pieces commented 1 year ago

Thinking about this more, I'm not sure it'll work b/c the lora initializers have the same id's between tasks so they'll conflict when using AddInitializers

pranavsharma commented 1 year ago

Not sure why it wouldn't work. Can you elaborate? A 1P team has the exact same use case (same weights across hundreds of models) and is using this API successfully. The AddInitializer API is only concerned with the raw bytes assuming the name used is the same as that in the model.

leugenea commented 1 year ago

You don't have to create multiple sessions for the same model; it's a huge waste of resources. Create only one session and reuse the same in different run threads. It's designed for concurrent runs.

But this way every Run must have it's own buffers as well as it's own thread pool. Isn't it a huge waste of resources too? I mean threads creation, memory allocations/deallocations, memory fragmentation. In my opinion having shared weights across multiple sessions but buffers/thread pools as session "state" seems a lot more optimal.

brian-at-pieces commented 1 year ago

Not sure why it wouldn't work. Can you elaborate? A 1P team has the exact same use case (same weights across hundreds of models) and is using this API successfully. The AddInitializer API is only concerned with the raw bytes assuming the name used is the same as that in the model.

My current working pipeline for a single session instance is: fintune T5 with LoRA for multiple tasks, convert each finetuned task model to onnx, extract and save the LoRA intializers for each task (they have the same names across all tasks), pass just one of these finetuned onnx models to ORT (because everything is the same between tasks besides lora initializers) and then in ORT: load a set of task-specifc LoRA initializers and then use AddInitializer to override the existing ones in the T5 model.

This works fine for a single model but let's say I wanted to run two sessions in parallel. I don't have the memory allowance to spin up two full instances of the T5 model but because two finetuned T5 models share 99% of the same initializers the AddInitializers seemed like a great solution. However, as I mentioned above, the LoRA initializers have the same names regardless of task so this breaks my method of loading them from file for each task and using AddInitializer to override the existing ones.

Sorry if this is a bit confusing. Lmk if I can clarify anything

pranavsharma commented 1 year ago

You don't have to create multiple sessions for the same model; it's a huge waste of resources. Create only one session and reuse the same in different run threads. It's designed for concurrent runs.

But this way every Run must have it's own buffers as well as it's own thread pool. Isn't it a huge waste of resources too? I mean threads creation, memory allocations/deallocations, memory fragmentation. In my opinion having shared weights across multiple sessions but buffers/thread pools as session "state" seems a lot more optimal.

There's only 1 threadpool and one arena allocator per session. Memory planning for tensors, etc, is done only once when creating and initializing the session. There's only one copy of the weights in the session obj. It's in your best interest to share the same session obj between multiple threads.

pranavsharma commented 1 year ago

@brian-pieces Sorry, not following this statement "this breaks my method of loading them from file for each task and using AddInitializer to override the existing ones.". What method are you using to load them?

I assume you've externalized the shared weights for each onnx file. By this I mean you've converted the onnx file such that they refer to external files for these shared weights. Now, it's a matter of just loading these external files once (outside ORT) and supplying the memory pointer to ORT for each session using the AddInitializer API. The name of the weight used should be the same as that used in the model files. They can be the same across all model files since the session is scoped per onnx model. If this is not clear, may be a repro might help.

leugenea commented 1 year ago

You don't have to create multiple sessions for the same model; it's a huge waste of resources. Create only one session and reuse the same in different run threads. It's designed for concurrent runs.

But this way every Run must have it's own buffers as well as it's own thread pool. Isn't it a huge waste of resources too? I mean threads creation, memory allocations/deallocations, memory fragmentation. In my opinion having shared weights across multiple sessions but buffers/thread pools as session "state" seems a lot more optimal.

There's only 1 threadpool and one arena allocator per session. Memory planning for tensors, etc, is done only once when creating and initializing the session. There's only one copy of the weights in the session obj. It's in your best interest to share the same session obj between multiple threads.

Okay, so when I set 1 inter- and 1 intra-thread in session options there will be created threadpool with 1 thread and all Run() methods will be executed sequentially? If not, how many threads are there in created threadpool?

About buffers. I get it that memory planning for weights stored only once per session object, but what about runtime (intermediate) buffers? Are they created and destroyed for each Run() invocation?

brian-at-pieces commented 1 year ago

@brian-pieces Sorry, not following this statement "this breaks my method of loading them from file for each task and using AddInitializer to override the existing ones.". What method are you using to load them?

I assume you've externalized the shared weights for each onnx file. By this I mean you've converted the onnx file such that they refer to external files for these shared weights. Now, it's a matter of just loading these external files once (outside ORT) and supplying the memory pointer to ORT for each session using the AddInitializer API. The name of the weight used should be the same as that used in the model files. They can be the same across all model files since the session is scoped per onnx model. If this is not clear, may be a repro might help.

@pranavsharma Yes that's right about externalzing the weights. The issue is that I'd like to have multiple sessions that shared a single session options ptr to utilize the AddInitializer API so I can reuse those 99% overlapping weights (everything but lora), however, b/c the lora weights differ for each model but have the same names, I don't think using a single session options will work

elephantpanda commented 1 year ago

I don't quite know how lora's work but could you do one session for the main network and then other separate sessions for the loras? It's my understanding that a lora is another network joined onto the first? I'm not sure how it's inserted. It add's more layers I think. Depending on where the layers are inserted.

On a separate note. If you have your network torch_model.bin and your lora.bin how do you convert the combined model into an onnx file?

brian-at-pieces commented 1 year ago

From the original paper, LoRA "freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture". So it's not another network, rather it's essentially injecting low-rank representations of a handful of the QKV matrices into the base model. With our confirguration it ends up injecting 30 [512x512] matrices into a T5 model.

On your second question, we're using a conditional generation T5 model from huggingface and LoRA doesn't change the model architeucture in a breaking way so we're able to use ORT convert_generation tool

pranavsharma commented 1 year ago

@brian-pieces You don't have to use the same session options obj. Session options is a light-weight obj. Just use separate session options obj for each session.

pranavsharma commented 1 year ago

@leugenea Can you move this to the discussion board as it's a discussion, not an issue.

Okay, so when I set 1 inter- and 1 intra-thread in session options there will be created threadpool with 1 thread and all Run() methods will be executed sequentially? If not, how many threads are there in created threadpool?

Yes. The default # of threads in the threadpool is equal to the # of physical cores on the machine.

About buffers. I get it that memory planning for weights stored only once per session object, but what about runtime (intermediate) buffers? Are they created and destroyed for each Run() invocation?

Yes

leugenea commented 1 year ago

@pranavsharma thank you for answers.

I think it would be nice to mention somewhere in documentation that Session objects are stateless and so could (and should) be used in multiple threads.

Also it would be nice to have some AddInitializers() and prepacked weights container usage example because it's only mentioned once (in tests) and it's not clear what and in what case will be stored in prepacked weights, loaded from model file/buffer or shared between sessions.

pranavsharma commented 1 year ago

@brian-pieces You don't have to use the same session options obj. Session options is a light-weight obj. Just use separate session options obj for each session.

@brian-pieces does this work for you? If so, can this issue be closed?

brian-at-pieces commented 1 year ago

@pranavsharma sorry for the delay, I've been experimenting a bit - yes that makes more sense. Thanks.

One final question: If I know that I'm going to be overriding initializers is it possible to use ClearField() on the initializers I'm going to override and then saving the modified model so that I could serve a smaller model? To be more specific, I'm finding the attribute (raw_data, float_data, etc.) where each tensor is storing the data and using ClearField() on that attribute

I tried this and I get this error at session creation time: Error TensorProto (tensor name: onnx::MatMul_1520_scale) should contain one and only one value field.

pranavsharma commented 1 year ago

@pranavsharma sorry for the delay, I've been experimenting a bit - yes that makes more sense. Thanks.

One final question: If I know that I'm going to be overriding initializers is it possible to use ClearField() on the initializers I'm going to override and then saving the modified model so that I could serve a smaller model? To be more specific, I'm finding the attribute (raw_data, float_data, etc.) where each tensor is storing the data and using ClearField() on that attribute

I tried this and I get this error at session creation time: Error TensorProto (tensor name: onnx::MatMul_1520_scale) should contain one and only one value field.

You can simply point to some dummy non-existing file.

brian-at-pieces commented 1 year ago

You can simply point to some dummy non-existing file.

Not sure what you mean. As opposed to using ClearField?

pranavsharma commented 1 year ago

You need to reference the initializers in the model whether they're external or part of the model. So, you can't just get rid of them. The dummy file comment was for the specific file you need to mention in the TensorProto.

brian-at-pieces commented 1 year ago

Ahh okay I hadn't noticed the data_location attribute on TensorProto before. Thanks for the help!

XavierMorin commented 1 year ago

You don't have to create multiple sessions for the same model; it's a huge waste of resources. Create only one session and reuse the same in different run threads. It's designed for concurrent runs.

Hi pranavsharma, could you provide a short C++ example doing just that?

pranavsharma commented 1 year ago

You don't have to create multiple sessions for the same model; it's a huge waste of resources. Create only one session and reuse the same in different run threads. It's designed for concurrent runs.

Hi pranavsharma, could you provide a short C++ example doing just that?

https://gist.github.com/pranavsharma/c3275863291b20b538cf0cb3265ef069

XavierMorin commented 1 year ago

You don't have to create multiple sessions for the same model; it's a huge waste of resources. Create only one session and reuse the same in different run threads. It's designed for concurrent runs.

Hi pranavsharma, could you provide a short C++ example doing just that?

https://gist.github.com/pranavsharma/c3275863291b20b538cf0cb3265ef069

Thank you so much for this timely answer, can you confirm that the runs are happening in parrallel and not sequentially using this method?

microsoft / onnxruntime