iOS app crashes when using LLama2model[Bug]

tstanek390 commented 1 year ago

🐛 Bug

iOS MLC Chat app crashes when trying to use downloaded custom Llama2 model.

To Reproduce

Steps to reproduce the behavior:

Compling the model with TVM Unity and recommended procedure for iOS devices.
Uploading the model to Huggingface.
Downloading the model to iOS app on iPhone.
Running the app with downloaded model.

Expected behavior

Using the custom model in iOS app.

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): IOS
Operating system (e.g. Ubuntu/Windows/MacOS/...): iOS 16.6, Mac OS Ventura 13.5.1
Device iPhone 14 Pro, Macbook Pro M2 16 Gb
How you installed MLC-LLM (conda, source): source
How you installed TVM-Unity (pip, source): source
Python version (e.g. 3.10): 3.11.4
GPU driver version (if applicable):
CUDA/cuDNN version (if applicable):
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): USE_GTEST: AUTO SUMMARIZE: OFF USE_IOS_RPC: OFF USE_ETHOSU: CUDA_VERSION: NOT-FOUND USE_LIBBACKTRACE: AUTO DLPACK_PATH: 3rdparty/dlpack/include USE_TENSORRT_CODEGEN: OFF USE_THRUST: OFF USE_TARGET_ONNX: OFF USE_AOT_EXECUTOR: ON BUILD_DUMMY_LIBTVM: OFF USE_CUDNN: OFF USE_TENSORRT_RUNTIME: OFF USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF USE_CCACHE: AUTO USE_ARM_COMPUTE_LIB: OFF USE_CPP_RTVM: USE_OPENCL_GTEST: /path/to/opencl/gtest USE_MKL: OFF USE_PT_TVMDSOOP: OFF USE_CLML: OFF USE_STACKVM_RUNTIME: OFF USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF ROCM_PATH: /opt/rocm USE_DNNL: OFF USE_VITIS_AI: OFF USE_LLVM: llvm-config --link-static USE_VERILATOR: OFF USE_TF_TVMDSOOP: OFF USE_THREADS: ON USE_MSVC_MT: OFF BACKTRACE_ON_SEGFAULT: OFF USE_GRAPH_EXECUTOR: ON USE_ROCBLAS: OFF GIT_COMMIT_HASH: be21b378b284eeab5b6a7721bf417ad7445fddf0 USE_VULKAN: OFF USE_RUST_EXT: OFF USE_CUTLASS: OFF USE_CPP_RPC: OFF USE_HEXAGON: OFF USE_CUSTOM_LOGGING: OFF USE_UMA: OFF USE_FALLBACK_STL_MAP: OFF USE_SORT: ON USE_RTTI: ON GIT_COMMIT_TIME: 2023-08-16 07:41:34 -0400 USE_HEXAGON_SDK: /path/to/sdk USE_BLAS: none USE_ETHOSN: OFF USE_LIBTORCH: OFF USE_RANDOM: ON USE_CUDA: OFF USE_COREML: OFF USE_AMX: OFF BUILD_STATIC_RUNTIME: OFF USE_CMSISNN: OFF USE_KHRONOS_SPIRV: OFF USE_CLML_GRAPH_EXECUTOR: OFF USE_TFLITE: OFF USE_HEXAGON_GTEST: /path/to/hexagon/gtest PICOJSON_PATH: 3rdparty/picojson USE_OPENCL_ENABLE_HOST_PTR: OFF INSTALL_DEV: OFF USE_PROFILER: ON USE_NNPACK: OFF LLVM_VERSION: 15.0.7 USE_OPENCL: OFF COMPILER_RT_PATH: 3rdparty/compiler-rt RANG_PATH: 3rdparty/rang/include USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF USE_OPENMP: OFF USE_BNNS: OFF USE_CUBLAS: OFF USE_METAL: ON USE_MICRO_STANDALONE_RUNTIME: OFF USE_HEXAGON_EXTERNAL_LIBS: OFF USE_ALTERNATIVE_LINKER: AUTO USE_BYODT_POSIT: OFF USE_HEXAGON_RPC: OFF USE_MICRO: OFF DMLC_PATH: 3rdparty/dmlc-core/include INDEX_DEFAULT_I64: ON USE_RELAY_DEBUG: OFF USE_RPC: ON USE_TENSORFLOW_PATH: none TVM_CLML_VERSION: USE_MIOPEN: OFF USE_ROCM: OFF USE_PAPI: OFF USE_CURAND: OFF TVM_CXX_COMPILER_PATH: /Library/Developer/CommandLineTools/usr/bin/c++ HIDE_PRIVATE_SYMBOLS: ON
Any other relevant information:

link to Huggingface TVM compiled model : https://huggingface.co/tstanek390/MedLLama2iOS

Thx for any kind of help, T.

Hzfengsy commented 1 year ago

It may be due to memory limitations, i.e. there is no enough memory on your devices. Please check if RedPajama can run successfully.

tstanek390 commented 1 year ago

RedPajama runs succesfully, same with LLama-2-7B-chat-hf. I have already tried to run a dozen of models I quantized and processed myself with MLC llm and its recommended parameters and settings, with the very same or even more drastic quantizations, but the app always crashes. All the models I'm trying to use are based on Llama-2-7b. Any ideas what could cause the issues ? :(

tqchen commented 1 year ago

Likely you need to limit seq Len and use q3 for llama 2 models

baiyutang commented 10 months ago

Likely you need to limit seq Len and use q3 for llama 2 models

limit seq Len this, is there a recommanded --max-seq-len value for llama 2? 512 ?

mlc-ai / mlc-llm