Open flatsiedatsie opened 3 months ago
I think you can simply achieve this by creating a second instance of MLCEngine
and call engine.reload()
on the new engine instance, and switch the engine once it finished loading.
Interesting idea, thanks.
The thing is, I don't always need to also start the model. For example, a user might want to go on a long airplane trip and pre-download some models from a list (kind of like pre-loading the map for Spain into OSMAND (or your map-app of choice) before going on a holiday.
But maybe I can just forego switching to the new engine instance? Then the files will still be downloaded anyway, right?
For comparison, this is how Wllama does it. It's just a helper function that loads the chunks into cache, and.. stops there.
@CharlieFRuan Follow up on this, if I do something like this to create and load additional engine instance but doesn't actually do completion, would this achieve the result of downloading additional models without causing GPU memory issues?
const engine1 = new MLCEngine();
const engine2 = new MLCEngine();
engine1.reload('model_1');
engine2.reload('model_2');
engine1.chat.completions.create({ messages });
Thanks for the thoughts and discussions @Neet-Nestor @flatsiedatsie! The code above will work fine: engine2
will not do completion and engine1
is not affected by engine2
. However, engine2
will load model_2
onto the WebGPU device, hence creating more burden for the hardware than just "downloading a model". So the code above may fail if model_1
and model_2
together exceed the VRAM that the device has.
Therefore, one way to "only download a model, without touching WebGPU" is to: On tvmjs side:
Add a flag onlyDownload
to fetchNDArrayCache()
https://github.com/apache/tvm/blob/1fcb62023f0a5f878abd5b43ec9e547933fb5fab/web/src/runtime.ts#L1450-L1456
Then the flag will make it return right before loading onto WebGPU (notice the device
is never used before this line):
https://github.com/apache/tvm/blob/1fcb62023f0a5f878abd5b43ec9e547933fb5fab/web/src/runtime.ts#L1566
Or create another API parallel to fetchNDArrayCache()
, and the two APIs can reuse the same code for downloading model
On webllm side:
I ended up coding a custom function that manually loads the files into the cache. I didn't expect splitting the loading from the inference to have such a big effect, but it has helped simplify my code. And it's now possible for users to use and load models they have already downloaded while they are waiting for the new one to download.
Perhaps related to this PR, but opposite:
I'd like to be able to easily ask WebLLM to download a second (or third, etc) model to cache, while continuing to use the existing, already loaded model. Then get a callback when the second model has loaded, so that I can inform the user they can now switch to the other model if they prefer.
Or is there an optimal way to do this already?
Currently my idea is to create a separate function to load the new shards into the cache manually, separately/outside of WebLLM. But I'd prefer to use WebLLM for this if there is a feature for this already (I searched the repo but couldn't find any).