Feature request: engine.preload()

flatsiedatsie commented 3 months ago

Perhaps related to this PR, but opposite:

I'd like to be able to easily ask WebLLM to download a second (or third, etc) model to cache, while continuing to use the existing, already loaded model. Then get a callback when the second model has loaded, so that I can inform the user they can now switch to the other model if they prefer.

Or is there an optimal way to do this already?

Currently my idea is to create a separate function to load the new shards into the cache manually, separately/outside of WebLLM. But I'd prefer to use WebLLM for this if there is a feature for this already (I searched the repo but couldn't find any).

Neet-Nestor commented 3 months ago

I think you can simply achieve this by creating a second instance of MLCEngine and call engine.reload() on the new engine instance, and switch the engine once it finished loading.

flatsiedatsie commented 3 months ago

Interesting idea, thanks.

The thing is, I don't always need to also start the model. For example, a user might want to go on a long airplane trip and pre-download some models from a list (kind of like pre-loading the map for Spain into OSMAND (or your map-app of choice) before going on a holiday.

But maybe I can just forego switching to the new engine instance? Then the files will still be downloaded anyway, right?

For comparison, this is how Wllama does it. It's just a helper function that loads the chunks into cache, and.. stops there.

Neet-Nestor commented 3 months ago

@CharlieFRuan Follow up on this, if I do something like this to create and load additional engine instance but doesn't actually do completion, would this achieve the result of downloading additional models without causing GPU memory issues?

const engine1 = new MLCEngine();
const engine2 = new MLCEngine();

engine1.reload('model_1');
engine2.reload('model_2');

engine1.chat.completions.create({ messages });

CharlieFRuan commented 3 months ago

Thanks for the thoughts and discussions @Neet-Nestor @flatsiedatsie! The code above will work fine: engine2 will not do completion and engine1 is not affected by engine2. However, engine2 will load model_2 onto the WebGPU device, hence creating more burden for the hardware than just "downloading a model". So the code above may fail if model_1 and model_2 together exceed the VRAM that the device has.

Therefore, one way to "only download a model, without touching WebGPU" is to: On tvmjs side:

Add a flag onlyDownload to fetchNDArrayCache() https://github.com/apache/tvm/blob/1fcb62023f0a5f878abd5b43ec9e547933fb5fab/web/src/runtime.ts#L1450-L1456
Then the flag will make it return right before loading onto WebGPU (notice the device is never used before this line): https://github.com/apache/tvm/blob/1fcb62023f0a5f878abd5b43ec9e547933fb5fab/web/src/runtime.ts#L1566
Or create another API parallel to fetchNDArrayCache(), and the two APIs can reuse the same code for downloading model

On webllm side:

Add an API that is similar to reload, but only does the download part, without the need of a WebGPU

flatsiedatsie commented 2 months ago

I ended up coding a custom function that manually loads the files into the cache. I didn't expect splitting the loading from the inference to have such a big effect, but it has helped simplify my code. And it's now possible for users to use and load models they have already downloaded while they are waiting for the new one to download.

mlc-ai / web-llm

Feature request: engine.preload() #529