[SERVE][CPP][Android] add native executable program to benchmark models

Hello,

I have modified and crafted some code to run LLM in adb shell or linux shell via MLC-LLM (btw. great appreciate to authors and contributors) as a binary executable program.

I'm not an expert in C++, so the code isn't perfect(actually it is tinkered and glued outputs of ChatGPT, Claude and my dog), but I think it's easy to read, understand and run.

How to setup:

setup MLC-LLM and virtualenv (install dependencies, TVM, etc. etc.)
Create build directory, e.g. build-aarch64-opencl. Run all following commands from this dir.

Run cmake from that dir:

cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_TOOLCHAIN_FILE=/home/piotr/android/sdk/ndk/26.1.10909125/build/cmake/android.toolchain.cmake \
-DCMAKE_INSTALL_PREFIX=. \
-DCMAKE_CXX_FLAGS="-O3" \
-DANDROID_ABI=arm64-v8a \
-DANDROID_NATIVE_API_LEVEL=android-31 \
-DANDROID_PLATFORM=android-31 \
-DCMAKE_FIND_ROOT_PATH_MODE_PACKAGE=ON \
-DANDROID_STL=c++_static \
-DUSE_HEXAGON_SDK=OFF \
-DMLC_LLM_INSTALL_STATIC_LIB=ON \
-DCMAKE_SKIP_INSTALL_ALL_DEPENDENCY=ON \
-DUSE_OPENCL=ON \
-DUSE_OPENCL_ENABLE_HOST_PTR=ON \
-DUSE_CUSTOM_LOGGING=OFF \
..

Build: make -j 8. Now you should have libmlc_llm_module.so, tvm/libtvm.so and llm_benchmark.
Do normal steps which you do for building mlc llm model, e.g. (I have used GPT-2, because its unmodified version match RAM of my phone. You may need to modify paths.):
```
mlc_llm convert_weight \
--quantization q0f16 \
-o ./gpt2-medium-q0f16 \
gpt2-medium/
```

mlc_llm gen_config \ --quantization q0f16 \ --max-batch-size 1 \ --conv-template gpt2 \ -o ./gpt2-medium-q0f16 \ gpt2-medium/

5. Build model library (I was getting error about missing compiler without $CC; I think android version don't matter that much; You may want to somehow modify predefined device - this is story for new books):
```sh
CC=$ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android31-clang++ \
TVM_NDK_CC=$ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android31-clang++ \
mlc_llm compile ./gpt2-medium-q0f16/mlc-chat-config.json \
    --device android:adreno-so \
    --host aarch64-linux-android \
    -o gpt2-medium-q0f16-opencl-aarch64.so

Upload files to your phone:

adb shell mkdir -p /data/local/tmp/mlc/gpt2-medium-aarch64-opencl/
adb push llm_benchmark libmlc_llm_module.so tvm/libtvm.so gpt2-medium-q0f16-opencl-aarch64.so /data/local/tmp/mlc/gpt2-medium-aarch64-opencl/
adb push gpt2-medium-q0f16 /data/local/tmp/mlc/gpt2-medium-aarch64-opencl/

Let's rock. Inside adb shell run following commands:

cd /data/local/tmp/mlc/gpt2-medium-aarch64-opencl/
LD_LIBRARY_PATH=. ./llm_benchmark \
./gpt2-medium-q0f16 \
./gpt2-medium-q0f16-opencl-aarch64.so \
"local" 4 60 250 \
"Give me short answer who you are?" 3

Arguments: 1st - folder with weights; 2nd - file with model library; 3rd - execution mode (server or interactive as alternatives); 4th 4 - means OpenCL (alternatives described in sourcecode); 5th - timeout in seconds of executation; 6th - max tokens; 7th - prompt; 8th - number of executions (in case of 1, it will print generated text).
If you would like to run on local computer, you should remove in cmake directives for cross compilation. And modify elements to suits your setup.
I'm afraid this cannot be merged, because it modifies some important files, like openai_format.cc.
Have a nice weekend :)

mlc-ai / mlc-llm

[SERVE][CPP][Android] add native executable program to benchmark models #2987