[Bug] Loading Model and cold start prompting freezes application

tobrun commented 9 months ago

🐛 Bug

I'm noticing this both with using the default LLama-2-7b and TinyLlama-1.1b. When loading the model for the first time or when prompting the model for the first time. It halts the process and the UI freezes. It takes a couple seconds before the UI thread becomes responsive again.

If you have multiple workers in your application outside of the model loading, it will result in the Android OS killing the application with an ANR (Android Not Responding) message. These type of ANR happen when you try performing too much work on the main thread so I'm assuming we are missing some asynchronous loading of the model weights into memory or when we try prompting the model.

To Reproduce

Steps to reproduce the behavior:

try loading llama-2 based model on a device

Expected behavior

The model weights are loading asynchronously into memory and interacting with the model is also being executed on a background thread.

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): Android
Operating system (e.g. Ubuntu/Windows/MacOS/...): Android 13
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): Samsung Tab 7 Ultra (16GB RAM)
How you installed MLC-LLM (conda, source): source
How you installed TVM-Unity (pip, source): source
Python version (e.g. 3.10): 3.11.x
GPU driver version (if applicable): NA
CUDA/cuDNN version (if applicable): NA
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): NA
Any other relevant information: this behavior has always been around

tqchen commented 9 months ago

This is interesting. I think one take away is that we should place the initialization of the chat module (aka chat.reload) also in the separate thread(just like the thread we use for decode).

Note that in iOS this is achieved with a threadworker, would be great to check what is the case in android

tqchen commented 9 months ago

See https://github.com/mlc-ai/mlc-llm/blob/main/ios/MLCChat/States/ChatState.swift#L325 how we wrap things in a threadWorker.push

tobrun commented 9 months ago

I'm planning to debug this more while looking into https://github.com/mlc-ai/mlc-llm/issues/1295, the issue is that atm I'm unable to debug because there is no native build system integration with gradle which makes it unable to debug c++ code directly from Android Studio.

tobrun commented 9 months ago

The initial loading deosnt' seem to be an issue anymore.. but the prompting with large sizes is. Following up on in #1401. Closing this one.

mlc-ai / mlc-llm