Open Vinaysukhesh98 opened 1 month ago
Not sure but some android vendor OS only provides limited memory for GPU but not all of the DRAM
@Hzfengsy if gpu not available then we can offload the amount of load to cpu?
On Android devices, the GPU memory available to OpenCL is usually not all of the phone's DRAM. You can try to split the model and compile it in pieces. I tried the CPU runtime but I couldn't optimize it well.
mlc-llm/cpp/serve/threaded_engine.cc:283: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 4762.535 MB, which is less than the sum of model weight size (4958.468 MB) and temporary buffer size (609.312 MB).
--tensor-parallel-shards $NGPU
tomlc_llm gen_config
or use quantization.--prefill-chunk-size
inmlc_llm gen_config
. 2024-08-05 18:08:52.206 12017-12060 AndroidRuntime ai.mlc.mlcchat E FATAL EXCEPTION: Thread-8 Process: ai.mlc.mlcchat, PID: 12017 org.apache.tvm.Base$TVMError: TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 4762.535 MB, which is less than the sum of model weight size (4958.468 MB) and temporary buffer size (609.312 MB).--tensor-parallel-shards $NGPU
tomlc_llm gen_config
or use quantization.If the temporary buffer size is too large, please use a smaller
--prefill-chunk-size
inmlc_llm gen_config
. Stack trace: File "/Downloads/mlc-llm/cpp/serve/threaded_engine.cc", line 283