harrisonvanderbyl / rwkv-cpp-accelerated

A torchless, c++ rwkv implementation using 8bit quantization, written in cuda/hip/vulkan for maximum compatibility and minimum dependencies
MIT License
306 stars 19 forks source link

Faster model loading #13

Open nenkoru opened 1 year ago

nenkoru commented 1 year ago

Currently, 7b and 14b models take 10s and 15s respectively to load. Pretty much the same as a vanilla rwkv does. It would a great thing to make those models to load as fast as possible which could lead to great inference capabilities.

I guess the best milestone to begin with could be a half of those. So 5s and ~7s respectively.

nenkoru commented 1 year ago

It was worth mentioning that I meant loading pytorch bindings. With #19 merged it loads 7b on my machine for exactly 5-6s. Haven't tried with 14b yet. As well as that there is a #14 PR ongoing which is supposed to give a lot of boost in terms of loading time of a model. Caveat is that on AMD for some reason mmaping doesn't go well.