Open limin05030 opened 1 month ago
Hi @limin05030 can you try 3-bit quantization, ex:q3f16_1
@mengshyu 3-bit quantization is available, but it is particularly slow, with only about 0.7 tok/s prefill and 1.1 tok/s decode.
I have a Samsung Galaxy S24 Ultra with 12GB of ram and run into the same error. Does any1 have any advice what to do in that case?
It seems that this level of hardware configuration is not yet able to drive models of 8B and above. Currently, I have found that only models around 1.5B can run smoothly.
It seems that this level of hardware configuration is not yet able to drive models of 8B and above. Currently, I have found that only models around 1.5B can run smoothly.
Hi @limin05030 , I am also exploring the same issue. I have observed that for Android phones, when running a model of around 2B with 4bit quantization,it is relatively smooth. However, 7B with 4bit will freeze for more than ten seconds before responding. The prefill and decode rates are both relatively slow (0.x or 1.x tok/s). Even newer devices (such as Snapdragon 8gen2) have this problem. Do you have the same conclusion as me? Can we say that this is caused by hardware deficiencies?
@limin05030
I was able to avoid this check on my phone by altering the this line of code in their library to something like:
return 11LL * 1024 * 1024 * 1024;
so it reads as 11Gb of available ram, so it is enough for the initial check.
Also observed that gpu_memory_utilization
is set somewhere around 0.85, but also have no idea whether it's possible to set it manually with some config.
Deleted the previously compiled library from cache and recompiled the whole library again and then ran the application, which did not crash at the start (the previous error was not displayed) and loaded the chat screen, however it did crash for me as it ran out of available ram space on my phone when loading the model weights.
I wonder if the OpenCL library is not reading the correct amount of RAM or just from the GPU device, because on some phones there are features like RAM Plus which add virtual RAM. Perhaps the model could use the available amount of RAM space, if the initial check is avoided,
When I tried to run
Llama-3-8B-Instruct
on Android using 4-bit quantization(My Android device isPixel 8 Pro
), I encountered an error of insufficient GPU memory. The specific information is as follows:The three solutions provided in the crash log:
gpu_memory_utilization
. Can it be modified for Android?tensor-parallel-shards
on Android, skip it.prefill-chunk-size
can only reduce the size of the temporary buffer, but the model weight size has already exceeded the available single GPU memory size, so this method is also ineffective.So, is there any way to use 4-bit quantization to make Llama-3-8B-Instruct run properly on Android?
By the way, as far as I know, the CPU and GPU on Android seem to share memory. If this is true, why is there a memory limit of 4352.000 MB for a single GPU on the
Pixel 8 Pro
? (It has 12GB of memory)