mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.64k stars 1.51k forks source link

gpu Memory utilisation Error, i have seen same error in 12GB & 8GB RAM device for llama3.1 and gemma 9b quantization q4f16_1 #2751

Open Vinaysukhesh98 opened 1 month ago

Vinaysukhesh98 commented 1 month ago

mlc-llm/cpp/serve/threaded_engine.cc:283: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 4762.535 MB, which is less than the sum of model weight size (4958.468 MB) and temporary buffer size (609.312 MB).

  1. You can set a larger "gpu_memory_utilization" value.
  2. If the model weight size is too large, please enable tensor parallelism by passing --tensor-parallel-shards $NGPU to mlc_llm gen_config or use quantization.
  3. If the temporary buffer size is too large, please use a smaller --prefill-chunk-size in mlc_llm gen_config. 2024-08-05 18:08:52.206 12017-12060 AndroidRuntime ai.mlc.mlcchat E FATAL EXCEPTION: Thread-8 Process: ai.mlc.mlcchat, PID: 12017 org.apache.tvm.Base$TVMError: TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 4762.535 MB, which is less than the sum of model weight size (4958.468 MB) and temporary buffer size (609.312 MB).
  4. You can set a larger "gpu_memory_utilization" value.
  5. If the model weight size is too large, please enable tensor parallelism by passing --tensor-parallel-shards $NGPU to mlc_llm gen_config or use quantization.
  6. If the temporary buffer size is too large, please use a smaller --prefill-chunk-size in mlc_llm gen_config. Stack trace: File "/Downloads/mlc-llm/cpp/serve/threaded_engine.cc", line 283

                                                                                                    at org.apache.tvm.Base.checkCall(Base.java:173)
                                                                                                    at org.apache.tvm.Function.invoke(Function.java:130)
                                                                                                    at ai.mlc.mlcllm.JSONFFIEngine.runBackgroundLoop(JSONFFIEngine.java:64)
                                                                                                    at ai.mlc.mlcllm.MLCEngine$backgroundWorker$1.invoke(MLCEngine.kt:42)
                                                                                                    at ai.mlc.mlcllm.MLCEngine$backgroundWorker$1.invoke(MLCEngine.kt:40)
                                                                                                    at ai.mlc.mlcllm.BackgroundWorker$start$1.invoke(MLCEngine.kt:19)
                                                                                                    at ai.mlc.mlcllm.BackgroundWorker$start$1.invoke(MLCEngine.kt:18)
                                                                                                    at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)
Hzfengsy commented 1 month ago

Not sure but some android vendor OS only provides limited memory for GPU but not all of the DRAM

Vinaysukhesh98 commented 1 month ago

@Hzfengsy if gpu not available then we can offload the amount of load to cpu?

shifeiwen commented 1 month ago

On Android devices, the GPU memory available to OpenCL is usually not all of the phone's DRAM. You can try to split the model and compile it in pieces. I tried the CPU runtime but I couldn't optimize it well.