Closed leugenea closed 1 year ago
Sorry, this might be a silly question, but why do you have multiple sessions for the same model? Why not have just one session and run things in batches?
Sorry, this might be a silly question, but why do you have multiple sessions for the same model? Why not have just one session and run things in batches?
I must maintain latency of model replies so I cannot use batches.
I want to have a pool of workers so I assume each worker must have it's own Session
instance.
You don't have to create multiple sessions for the same model; it's a huge waste of resources. Create only one session and reuse the same in different run threads. It's designed for concurrent runs.
I actually have a similar issue. I'm finetuning a base T5 model for multiple tasks using LoRA so the majority of the weights are going to be the same between my sessions. I'm successfully using AddInitializers
to replace the existing weights at runtime with the task-specific LoRA weights, but I'd like to run multiple tasks without blowing up memory so I was thinking of using this same method to create multiple session instances that shared the overlapping weights (i.e all weights besides the LorA weights)
Thinking about this more, I'm not sure it'll work b/c the lora initializers have the same id's between tasks so they'll conflict when using AddInitializers
Not sure why it wouldn't work. Can you elaborate? A 1P team has the exact same use case (same weights across hundreds of models) and is using this API successfully. The AddInitializer API is only concerned with the raw bytes assuming the name used is the same as that in the model.
You don't have to create multiple sessions for the same model; it's a huge waste of resources. Create only one session and reuse the same in different run threads. It's designed for concurrent runs.
But this way every Run
must have it's own buffers as well as it's own thread pool. Isn't it a huge waste of resources too? I mean threads creation, memory allocations/deallocations, memory fragmentation.
In my opinion having shared weights across multiple sessions but buffers/thread pools as session "state" seems a lot more optimal.
Not sure why it wouldn't work. Can you elaborate? A 1P team has the exact same use case (same weights across hundreds of models) and is using this API successfully. The AddInitializer API is only concerned with the raw bytes assuming the name used is the same as that in the model.
My current working pipeline for a single session instance is: fintune T5 with LoRA for multiple tasks, convert each finetuned task model to onnx, extract and save the LoRA intializers for each task (they have the same names across all tasks), pass just one of these finetuned onnx models to ORT (because everything is the same between tasks besides lora initializers) and then in ORT: load a set of task-specifc LoRA initializers and then use AddInitializer
to override the existing ones in the T5 model.
This works fine for a single model but let's say I wanted to run two sessions in parallel. I don't have the memory allowance to spin up two full instances of the T5 model but because two finetuned T5 models share 99% of the same initializers the AddInitializers
seemed like a great solution. However, as I mentioned above, the LoRA initializers have the same names regardless of task so this breaks my method of loading them from file for each task and using AddInitializer
to override the existing ones.
Sorry if this is a bit confusing. Lmk if I can clarify anything
You don't have to create multiple sessions for the same model; it's a huge waste of resources. Create only one session and reuse the same in different run threads. It's designed for concurrent runs.
But this way every
Run
must have it's own buffers as well as it's own thread pool. Isn't it a huge waste of resources too? I mean threads creation, memory allocations/deallocations, memory fragmentation. In my opinion having shared weights across multiple sessions but buffers/thread pools as session "state" seems a lot more optimal.
There's only 1 threadpool and one arena allocator per session. Memory planning for tensors, etc, is done only once when creating and initializing the session. There's only one copy of the weights in the session obj. It's in your best interest to share the same session obj between multiple threads.
@brian-pieces Sorry, not following this statement "this breaks my method of loading them from file for each task and using AddInitializer to override the existing ones.". What method are you using to load them?
I assume you've externalized the shared weights for each onnx file. By this I mean you've converted the onnx file such that they refer to external files for these shared weights. Now, it's a matter of just loading these external files once (outside ORT) and supplying the memory pointer to ORT for each session using the AddInitializer API. The name of the weight used should be the same as that used in the model files. They can be the same across all model files since the session is scoped per onnx model. If this is not clear, may be a repro might help.
You don't have to create multiple sessions for the same model; it's a huge waste of resources. Create only one session and reuse the same in different run threads. It's designed for concurrent runs.
But this way every
Run
must have it's own buffers as well as it's own thread pool. Isn't it a huge waste of resources too? I mean threads creation, memory allocations/deallocations, memory fragmentation. In my opinion having shared weights across multiple sessions but buffers/thread pools as session "state" seems a lot more optimal.There's only 1 threadpool and one arena allocator per session. Memory planning for tensors, etc, is done only once when creating and initializing the session. There's only one copy of the weights in the session obj. It's in your best interest to share the same session obj between multiple threads.
Okay, so when I set 1 inter- and 1 intra-thread in session options there will be created threadpool with 1 thread and all Run()
methods will be executed sequentially? If not, how many threads are there in created threadpool?
About buffers. I get it that memory planning for weights stored only once per session object, but what about runtime (intermediate) buffers? Are they created and destroyed for each Run()
invocation?
@brian-pieces Sorry, not following this statement "this breaks my method of loading them from file for each task and using AddInitializer to override the existing ones.". What method are you using to load them?
I assume you've externalized the shared weights for each onnx file. By this I mean you've converted the onnx file such that they refer to external files for these shared weights. Now, it's a matter of just loading these external files once (outside ORT) and supplying the memory pointer to ORT for each session using the AddInitializer API. The name of the weight used should be the same as that used in the model files. They can be the same across all model files since the session is scoped per onnx model. If this is not clear, may be a repro might help.
@pranavsharma Yes that's right about externalzing the weights. The issue is that I'd like to have multiple sessions that shared a single session options ptr to utilize the AddInitializer API so I can reuse those 99% overlapping weights (everything but lora), however, b/c the lora weights differ for each model but have the same names, I don't think using a single session options will work
I don't quite know how lora's work but could you do one session for the main network and then other separate sessions for the loras? It's my understanding that a lora is another network joined onto the first? I'm not sure how it's inserted. It add's more layers I think. Depending on where the layers are inserted.
On a separate note. If you have your network torch_model.bin and your lora.bin how do you convert the combined model into an onnx file?
From the original paper, LoRA "freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture". So it's not another network, rather it's essentially injecting low-rank representations of a handful of the QKV matrices into the base model. With our confirguration it ends up injecting 30 [512x512] matrices into a T5 model.
On your second question, we're using a conditional generation T5 model from huggingface and LoRA doesn't change the model architeucture in a breaking way so we're able to use ORT convert_generation tool
@brian-pieces You don't have to use the same session options obj. Session options is a light-weight obj. Just use separate session options obj for each session.
@leugenea Can you move this to the discussion board as it's a discussion, not an issue.
Okay, so when I set 1 inter- and 1 intra-thread in session options there will be created threadpool with 1 thread and all
Run()
methods will be executed sequentially? If not, how many threads are there in created threadpool?
Yes. The default # of threads in the threadpool is equal to the # of physical cores on the machine.
About buffers. I get it that memory planning for weights stored only once per session object, but what about runtime (intermediate) buffers? Are they created and destroyed for each
Run()
invocation?
Yes
@pranavsharma thank you for answers.
I think it would be nice to mention somewhere in documentation that Session
objects are stateless and so could (and should) be used in multiple threads.
Also it would be nice to have some AddInitializers()
and prepacked weights container usage example because it's only mentioned once (in tests) and it's not clear what and in what case will be stored in prepacked weights, loaded from model file/buffer or shared between sessions.
@brian-pieces You don't have to use the same session options obj. Session options is a light-weight obj. Just use separate session options obj for each session.
@brian-pieces does this work for you? If so, can this issue be closed?
@pranavsharma sorry for the delay, I've been experimenting a bit - yes that makes more sense. Thanks.
One final question: If I know that I'm going to be overriding initializers is it possible to use ClearField()
on the initializers I'm going to override and then saving the modified model so that I could serve a smaller model? To be more specific, I'm finding the attribute (raw_data, float_data, etc.) where each tensor is storing the data and using ClearField()
on that attribute
I tried this and I get this error at session creation time: Error TensorProto (tensor name: onnx::MatMul_1520_scale) should contain one and only one value field.
@pranavsharma sorry for the delay, I've been experimenting a bit - yes that makes more sense. Thanks.
One final question: If I know that I'm going to be overriding initializers is it possible to use
ClearField()
on the initializers I'm going to override and then saving the modified model so that I could serve a smaller model? To be more specific, I'm finding the attribute (raw_data, float_data, etc.) where each tensor is storing the data and usingClearField()
on that attributeI tried this and I get this error at session creation time:
Error TensorProto (tensor name: onnx::MatMul_1520_scale) should contain one and only one value field.
You can simply point to some dummy non-existing file.
You can simply point to some dummy non-existing file.
Not sure what you mean. As opposed to using ClearField
?
You need to reference the initializers in the model whether they're external or part of the model. So, you can't just get rid of them. The dummy file comment was for the specific file you need to mention in the TensorProto.
Ahh okay I hadn't noticed the data_location
attribute on TensorProto
before. Thanks for the help!
You don't have to create multiple sessions for the same model; it's a huge waste of resources. Create only one session and reuse the same in different run threads. It's designed for concurrent runs.
Hi pranavsharma, could you provide a short C++ example doing just that?
You don't have to create multiple sessions for the same model; it's a huge waste of resources. Create only one session and reuse the same in different run threads. It's designed for concurrent runs.
Hi pranavsharma, could you provide a short C++ example doing just that?
https://gist.github.com/pranavsharma/c3275863291b20b538cf0cb3265ef069
You don't have to create multiple sessions for the same model; it's a huge waste of resources. Create only one session and reuse the same in different run threads. It's designed for concurrent runs.
Hi pranavsharma, could you provide a short C++ example doing just that?
https://gist.github.com/pranavsharma/c3275863291b20b538cf0cb3265ef069
Thank you so much for this timely answer, can you confirm that the runs are happening in parrallel and not sequentially using this method?
Describe the issue
If I have multiple inference sessions for the same model I have to store constant weights multiple times despite the fact that they're constant and cannot be changed in runtime.
From what I see in documentation (Share initializer(s) and their ORT pre-processed version(s) between sessions section) there's a way to manually add initializer using
AddInitilizer()
function. And more: looks like I can useCreateSessionWithPrepackedWeightsContainer()
function variant to prepack these manually added initializers.Am I correct that to share all weights between multiple inference sessions I should:
AddInitilizer()
,CreateSessionWithPrepackedWeightsContainer()
,So my questions are:
To reproduce
Platform
Any
OS Version
Any
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
v1.14.1
ONNX Runtime API
Any
Architecture
X64
Execution Provider
Default CPU, CUDA
Is this a quantized model?
Should not matter