CPU offloading - Githubissues

mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation

https://llm.mlc.ai/

Apache License 2.0

19.16k stars 1.57k forks source link

CPU offloading #5

Open LiliumSancta opened 1 year ago

LiliumSancta commented 1 year ago

Incredible project, i managed to run the model with good speed on my hardware (AMD) thanks. I have a question do you have any plans to offload the weights and be able to run bigger models like 13B or 30B with less vram?

junrushao commented 1 year ago

Hey thanks for your interest! Our backend (TVM Unity) supports AMD CPUs out of box, so it wouldn’t be too challenging (likely tens of lines) to introduce support for them. Not too sure about the latency aspect though

Dampfinchen commented 1 year ago

Yeah, the limitations current LLM inference programs like Oobabooga WebUI and KoboldAi have is that CPU offloading is very slow for these. Maybe the MLC team could build a very fast CPU offloader that allocates RAM on the fly as soon as VRAM is overflowing, to prevent out of memory errors at high context sizes and with big models and still being relatively fast.

AlphaAtlas commented 1 year ago

The dream (for my hardware) is being able to split up the model between separate vulkan devices... maybe splitting up layers like llama.cpp does? This would allow for hybrid IGP+GPU inference, or multi GPU splitting.

Splitting the model between multiple backends is probably outside the domain of tvm though, right?

sirus20x6 commented 1 year ago

as someone that currently does cpu inference I would love this feature.

MikeLP commented 9 months ago

Any updates?