mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.63k stars 1.51k forks source link

[Question] Llama3: How to solve GPU Out of Memory Error on Pixel 8 Pro? #2703

Open limin05030 opened 1 month ago

limin05030 commented 1 month ago

When I tried to run Llama-3-8B-Instruct on Android using 4-bit quantization(My Android device is Pixel 8 Pro), I encountered an error of insufficient GPU memory. The specific information is as follows:

Process: ai.mlc.mlcchat, PID: 5790
    org.apache.tvm.Base$TVMError: TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 4352.000 MB, which is less than the sum of model weight size (4308.133 MB) and temporary buffer size (278.504 MB).
    1. You can set a larger "gpu_memory_utilization" value.
    2. If the model weight size is too large, please enable tensor parallelism by passing `--tensor-parallel-shards $NGPU` to `mlc_llm gen_config` or use quantization.
    3. If the temporary buffer size is too large, please use a smaller `--prefill-chunk-size` in `mlc_llm gen_config`.
    Stack trace:
      File "/home/ll/MLC/Llama-3-8B-Instruct/mlc-llm/cpp/serve/threaded_engine.cc", line 283

    ...

The three solutions provided in the crash log:

  1. I don't know how to modify gpu_memory_utilization. Can it be modified for Android?
  2. Due to the lack of support for tensor-parallel-shards on Android, skip it.
  3. Modifying prefill-chunk-size can only reduce the size of the temporary buffer, but the model weight size has already exceeded the available single GPU memory size, so this method is also ineffective.

So, is there any way to use 4-bit quantization to make Llama-3-8B-Instruct run properly on Android?

By the way, as far as I know, the CPU and GPU on Android seem to share memory. If this is true, why is there a memory limit of 4352.000 MB for a single GPU on the Pixel 8 Pro? (It has 12GB of memory)

My Pixel 5 also has the same size limit. For Pixel phones ranging from 5 to 8 Pro, hasn't the memory size of a single GPU been upgraded?

Or is it because some parameters in the model limit the size of memory that a single GPU can use?

mengshyu commented 1 month ago

Hi @limin05030 can you try 3-bit quantization, ex:q3f16_1

limin05030 commented 1 month ago

@mengshyu 3-bit quantization is available, but it is particularly slow, with only about 0.7 tok/s prefill and 1.1 tok/s decode.

bkiefe commented 1 month ago

I have a Samsung Galaxy S24 Ultra with 12GB of ram and run into the same error. Does any1 have any advice what to do in that case?

limin05030 commented 1 month ago

It seems that this level of hardware configuration is not yet able to drive models of 8B and above. Currently, I have found that only models around 1.5B can run smoothly.

ponytaill commented 1 month ago

It seems that this level of hardware configuration is not yet able to drive models of 8B and above. Currently, I have found that only models around 1.5B can run smoothly.

Hi @limin05030 , I am also exploring the same issue. I have observed that for Android phones, when running a model of around 2B with 4bit quantization,it is relatively smooth. However, 7B with 4bit will freeze for more than ten seconds before responding. The prefill and decode rates are both relatively slow (0.x or 1.x tok/s). Even newer devices (such as Snapdragon 8gen2) have this problem. Do you have the same conclusion as me? Can we say that this is caused by hardware deficiencies?

martinkorelic commented 1 month ago

@limin05030 I was able to avoid this check on my phone by altering the this line of code in their library to something like: return 11LL * 1024 * 1024 * 1024; so it reads as 11Gb of available ram, so it is enough for the initial check. Also observed that gpu_memory_utilization is set somewhere around 0.85, but also have no idea whether it's possible to set it manually with some config.

Deleted the previously compiled library from cache and recompiled the whole library again and then ran the application, which did not crash at the start (the previous error was not displayed) and loaded the chat screen, however it did crash for me as it ran out of available ram space on my phone when loading the model weights.

I wonder if the OpenCL library is not reading the correct amount of RAM or just from the GPU device, because on some phones there are features like RAM Plus which add virtual RAM. Perhaps the model could use the available amount of RAM space, if the initial check is avoided,