ngxson / wllama

WebAssembly binding for llama.cpp - Enabling in-browser LLM inference
https://huggingface.co/spaces/ngxson/wllama
MIT License
441 stars 21 forks source link

Large models fail to load from cache on iOS browsers, but load and run fine when uncached #72

Closed felladrin closed 4 months ago

felladrin commented 5 months ago

I've noticed that iOS browsers can load and run https://huggingface.co/Felladrin/gguf-sharded-Qwen1.5-0.5B-Chat_llamafy/resolve/main/Qwen1.5-0.5B-Chat_llamafy.Q3_K_M.shard-00001-of-00003.gguf (The sum of the three parts is 350MB). It always works the first time the model is loaded (to confirm this, I've loaded and run the model several times in sequence in incognito mode).

The problem is that the model can't run on iOS whenever it's loaded from the cache. It throws Out Of Memory 100% of the time.

Based on this, I started to think it's caused by a Promise.all() trying to load all at once in the memory. (Note: when the file is uncached, it downloads and loads the file into memory gradually, which [I guess] prevents the issue from happening.)

And, checking the code, there are only two files using Promise.all():

https://github.com/ngxson/wllama/blob/a5e919bc55e59408fc90c4d5ebc701f76329bc67/src/downloader/multi-downloads.ts#L45-L65

https://github.com/ngxson/wllama/blob/a5e919bc55e59408fc90c4d5ebc701f76329bc67/src/worker.ts#L437-L440

I'm also short on time but I'll keep you posted if I find something new.

PS: Addons loaded on the mobile browser (e.g. Grammarly) can reduce the available memory to load/run the model, so the tests above were done with all extensions disabled.

felladrin commented 5 months ago

Unfortunately, replacing the "Promise.all" with "for-of-await" didn't solve the issue.

In any case, I'm leaving here the diff of what I tried, for the record:

diff --git a/src/downloader/multi-downloads.ts b/src/downloader/multi-downloads.ts
index a27edfe..f24a426 100644
--- a/src/downloader/multi-downloads.ts
+++ b/src/downloader/multi-downloads.ts
@@ -44,7 +44,7 @@ export class MultiDownloads {

   async run(): Promise<Blob[]> {
     // create all Blobs
-    await Promise.all(this.tasks.map(async (task) => {
+    for await (const task of this.tasks) {
       task.blob = await GGUFRemoteBlob.create(task.url, {
         logger: this.logger,
         useCache: this.useCache,
@@ -54,7 +54,7 @@ export class MultiDownloads {
           this.updateProgress(task);
         },
       });
-    }));
+    }
     // calculate totalBytes
     this.totalBytes = this.tasks.reduce((n, task) => n + task.blob.size, 0);
     // run N dispatchers
diff --git a/src/worker.ts b/src/worker.ts
index d9a6c72..e840cc5 100644
--- a/src/worker.ts
+++ b/src/worker.ts
@@ -428,17 +428,11 @@ export class ProxyToWorker {
     });

     // allocate all files
-    const nativeFiles: ({ id: number } & typeof ggufFiles[number])[] = [];
     for (const file of ggufFiles) {
       const id = await this.fileAlloc(file.name, file.blob.size);
-      nativeFiles.push({ id, ...file });
+      await this.fileWrite(id, file.blob);
     }

-    // stream files
-    await Promise.all(nativeFiles.map(file => {
-      return this.fileWrite(file.id, file.blob);
-    }));
-
     return res;
   }
ngxson commented 5 months ago

MultiDownloads will limit to maxParallel files being loaded at the same time, so I don't see any problem with the Promise.all. The file is only read when dispatcher() function is called.

https://github.com/ngxson/wllama/blob/a5e919bc55e59408fc90c4d5ebc701f76329bc67/src/downloader/multi-downloads.ts#L61C30-L61C41

What I suspect is that maybe OPFS on iOS always read files in big chunk, so it doesn't fit on memory. Could you try with a smaller maxParallel?

felladrin commented 5 months ago

Ah, yes, I forgot to say that I've been forcing both config.parallelDownloads and config.n_threads to 1 on iOS as it solves the issue of unstable downloads on iOS.

So the tests above were all using maxParallel == 1.

OPFS on iOS always read files in big chunk, so it doesn't fit on memory

I agree that this is the problem. But currently it's only an issue when loading files from cache.

ngxson commented 5 months ago

Another idea, can you split the model to smaller parts. For example instead of --split-max-size, use --split-max-tensors 1 so each file contains one single tensor.

felladrin commented 4 months ago

Thanks for the suggestion, @ngxson!

I've split Qwen2-0.5B-Instruct-llamafy to the minimum possible tensors per file (5) and tested it. In total, there were 63 parts, summing 339MB, with Q2_K (https://huggingface.co/Felladrin/gguf-sharded-Qwen2-0.5B-Instruct-llamafy/resolve/main/Qwen2-0.5B-Instruct-llamafy.Q2_K.shard-00001-of-00063.gguf). But even so it fails, unfortunately.

I also considered the problem could be that it's creating several new Worker when running on Safari: https://github.com/ngxson/wllama/blob/896c160164156ba23291ba122cc44a9cd04ada52/src/cache-manager.ts#L183-L189

But it's just speculation.

As it's a too-specific Safari-mobile memory problem, to avoid spending more of our time on this, I'll close this issue. And besides that, I can run https://huggingface.co/Felladrin/gguf-sharded-Llama-160M-Chat-v1/resolve/main/Llama-160M-Chat-v1.Q6_K.shard-00001-of-00006.gguf perfectly fine on iPhone ✌️