Open LiliumSancta opened 1 year ago
Hey thanks for your interest! Our backend (TVM Unity) supports AMD CPUs out of box, so it wouldn’t be too challenging (likely tens of lines) to introduce support for them. Not too sure about the latency aspect though
Yeah, the limitations current LLM inference programs like Oobabooga WebUI and KoboldAi have is that CPU offloading is very slow for these. Maybe the MLC team could build a very fast CPU offloader that allocates RAM on the fly as soon as VRAM is overflowing, to prevent out of memory errors at high context sizes and with big models and still being relatively fast.
The dream (for my hardware) is being able to split up the model between separate vulkan devices... maybe splitting up layers like llama.cpp does? This would allow for hybrid IGP+GPU inference, or multi GPU splitting.
Splitting the model between multiple backends is probably outside the domain of tvm though, right?
as someone that currently does cpu inference I would love this feature.
Any updates?
Incredible project, i managed to run the model with good speed on my hardware (AMD) thanks. I have a question do you have any plans to offload the weights and be able to run bigger models like 13B or 30B with less vram?