Description

Currently to perform inference of the models generated the user needs to interact with the model generated writing a small python script accordingly to how the model is saved by library, by loading the resulting checkpoint or model saved after training.

Moreover a lot of optimization can be integrated to speed-up the inference such as:

CPU Offloading.
llama.ccp implementation
accelerate / deepspeed distributed inference.

TODO

[ ] Implement Inference Class to make inference very easy and even possible from CLI.
[ ] Implement Inference with the optimisations available from deepspeed
[ ] Implement inference with the optimisations available from accelerate
[ ] Implement fast lama inference with known library llama.ccp implementation

nebuly-ai / optimate

[Chatllama] Support Inference for trained models. #320

Description

TODO