Closed felladrin closed 4 months ago
Unfortunately, replacing the "Promise.all" with "for-of-await" didn't solve the issue.
In any case, I'm leaving here the diff of what I tried, for the record:
diff --git a/src/downloader/multi-downloads.ts b/src/downloader/multi-downloads.ts
index a27edfe..f24a426 100644
--- a/src/downloader/multi-downloads.ts
+++ b/src/downloader/multi-downloads.ts
@@ -44,7 +44,7 @@ export class MultiDownloads {
async run(): Promise<Blob[]> {
// create all Blobs
- await Promise.all(this.tasks.map(async (task) => {
+ for await (const task of this.tasks) {
task.blob = await GGUFRemoteBlob.create(task.url, {
logger: this.logger,
useCache: this.useCache,
@@ -54,7 +54,7 @@ export class MultiDownloads {
this.updateProgress(task);
},
});
- }));
+ }
// calculate totalBytes
this.totalBytes = this.tasks.reduce((n, task) => n + task.blob.size, 0);
// run N dispatchers
diff --git a/src/worker.ts b/src/worker.ts
index d9a6c72..e840cc5 100644
--- a/src/worker.ts
+++ b/src/worker.ts
@@ -428,17 +428,11 @@ export class ProxyToWorker {
});
// allocate all files
- const nativeFiles: ({ id: number } & typeof ggufFiles[number])[] = [];
for (const file of ggufFiles) {
const id = await this.fileAlloc(file.name, file.blob.size);
- nativeFiles.push({ id, ...file });
+ await this.fileWrite(id, file.blob);
}
- // stream files
- await Promise.all(nativeFiles.map(file => {
- return this.fileWrite(file.id, file.blob);
- }));
-
return res;
}
MultiDownloads
will limit to maxParallel
files being loaded at the same time, so I don't see any problem with the Promise.all
. The file is only read when dispatcher()
function is called.
What I suspect is that maybe OPFS on iOS always read files in big chunk, so it doesn't fit on memory. Could you try with a smaller maxParallel
?
Ah, yes, I forgot to say that I've been forcing both config.parallelDownloads
and config.n_threads
to 1
on iOS as it solves the issue of unstable downloads on iOS.
So the tests above were all using maxParallel == 1
.
OPFS on iOS always read files in big chunk, so it doesn't fit on memory
I agree that this is the problem. But currently it's only an issue when loading files from cache.
Another idea, can you split the model to smaller parts. For example instead of --split-max-size
, use --split-max-tensors 1
so each file contains one single tensor.
Thanks for the suggestion, @ngxson!
I've split Qwen2-0.5B-Instruct-llamafy to the minimum possible tensors per file (5
) and tested it. In total, there were 63 parts, summing 339MB, with Q2_K (https://huggingface.co/Felladrin/gguf-sharded-Qwen2-0.5B-Instruct-llamafy/resolve/main/Qwen2-0.5B-Instruct-llamafy.Q2_K.shard-00001-of-00063.gguf
). But even so it fails, unfortunately.
I also considered the problem could be that it's creating several new Worker
when running on Safari:
https://github.com/ngxson/wllama/blob/896c160164156ba23291ba122cc44a9cd04ada52/src/cache-manager.ts#L183-L189
But it's just speculation.
As it's a too-specific Safari-mobile memory problem, to avoid spending more of our time on this, I'll close this issue. And besides that, I can run https://huggingface.co/Felladrin/gguf-sharded-Llama-160M-Chat-v1/resolve/main/Llama-160M-Chat-v1.Q6_K.shard-00001-of-00006.gguf
perfectly fine on iPhone ✌️
I've noticed that iOS browsers can load and run
https://huggingface.co/Felladrin/gguf-sharded-Qwen1.5-0.5B-Chat_llamafy/resolve/main/Qwen1.5-0.5B-Chat_llamafy.Q3_K_M.shard-00001-of-00003.gguf
(The sum of the three parts is 350MB). It always works the first time the model is loaded (to confirm this, I've loaded and run the model several times in sequence in incognito mode).The problem is that the model can't run on iOS whenever it's loaded from the cache. It throws Out Of Memory 100% of the time.
Based on this, I started to think it's caused by a
Promise.all()
trying to load all at once in the memory. (Note: when the file is uncached, it downloads and loads the file into memory gradually, which [I guess] prevents the issue from happening.)And, checking the code, there are only two files using
Promise.all()
:https://github.com/ngxson/wllama/blob/a5e919bc55e59408fc90c4d5ebc701f76329bc67/src/downloader/multi-downloads.ts#L45-L65
https://github.com/ngxson/wllama/blob/a5e919bc55e59408fc90c4d5ebc701f76329bc67/src/worker.ts#L437-L440
I'm also short on time but I'll keep you posted if I find something new.
PS: Addons loaded on the mobile browser (e.g. Grammarly) can reduce the available memory to load/run the model, so the tests above were done with all extensions disabled.