hitomi-team / sukima

A ready-to-deploy container for implementing an easy to use REST API to access Language Models.
GNU General Public License v2.0
64 stars 13 forks source link

Implement faster and efficient model loading #30

Closed harubaru closed 2 years ago

harubaru commented 2 years ago

From this blog post, a method using a zero-copy strategy was implemented which allows for faster and efficient model loading.

https://medium.com/ibm-data-ai/how-to-load-pytorch-models-340-times-faster-with-ray-8b

Here are some results from running the tensorization code on my crappy laptop:

1650004019

harubaru commented 2 years ago

Quantization works with Tensorized models, however Softprompting does not due to FrozenBNBEmbedding being incompatible with torch.nn.Embeddding during the resize embedding operation to use the Softprompts.

https://github.com/hitomi-team/sukima/blob/ce1ff71e43bc93440446c5afa74b89fc04df84fa/app/gpt/gpthf.py#L335

More work would have to be done on that but that is out of scope for this PR.