microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.75k stars 3.96k forks source link

[REQUEST] Saving or Exporting `InferenceEngine`s to support model scaling in production #1694

Open joehoover opened 2 years ago

joehoover commented 2 years ago

Is your feature request related to a problem? Please describe. I want to use DeepSpeed Inference in production and I am wondering whether there are suggested solutions for reducing the scaling latency introduced by init_inference(). It takes a considerable amount of time to initialize an inference engine and this will make it difficult to dynamically scale model instances.

Frankly, I assume there already is a solution, but I have not found a description in the documentation.

Describe the solution you'd like I want to reduce or eliminate the latency introduced by the deepspeed.init_inference() call that is used in the DS Inference tutorials. For example, is it possible to export/save an initialized inference engine?

Describe alternatives you've considered I have not considered any alternatives, but I would be open to suggestions.

Additional context I am new to DeepSpeed and I do not know how the time requirements of inference engine initialization vary across model types and sizes. I was motivated to open this issue after testing the GPT-J inference kernels --- I didn't time init_inference(), but it certainly took long enough to pose an obstacle for efficient scaling.

EDIT: initialization takes about 57 seconds on my system (AWS SageMaker ml.g4dn.12xlarge instance).

RezaYazdaniAminabadi commented 2 years ago

Hi @joehoover,

Thanks for using bringing up this challenge. I will definitely look into this and share more information on this.

Best, Reza

naxty commented 2 years ago

Hi @RezaYazdaniAminabadi ,

I'm interested into this request as well. Do you have any update/information to share yet?

Best, Nico

joaopcm1996 commented 1 year ago

Any updates or workarounds on this? DeepSpeed provides great benefits for inference, but if loading a model takes over a minute, it defeats the purpose in production.

awan-10 commented 1 year ago

Adding @lekurile to this conversation.

manmay-nakhashi commented 1 year ago

is there any solution ?