thanks to not being restrained by hf nonsense we can probably begin implementing trendy stuff hf doesnt have. for a backend to have full integration in kobold, it needs to support callbacks for viewing and manipulating token logits as they're generated.
some stuff to look into:
[ ] ggml/llama.cpp -- Really fast cpu inference. Written in C++, python binding available here. not sure if it supports all we need it to but its worth a looksie
[ ] rwkv -- RNN based language model, module interface available here. kind of implemented in kobold but implementation is rough
thanks to not being restrained by hf nonsense we can probably begin implementing trendy stuff hf doesnt have. for a backend to have full integration in kobold, it needs to support callbacks for viewing and manipulating token logits as they're generated.
some stuff to look into: