Closed tobrun closed 9 months ago
This is interesting. I think one take away is that we should place the initialization of the chat module (aka chat.reload) also in the separate thread(just like the thread we use for decode).
Note that in iOS this is achieved with a threadworker, would be great to check what is the case in android
See https://github.com/mlc-ai/mlc-llm/blob/main/ios/MLCChat/States/ChatState.swift#L325 how we wrap things in a threadWorker.push
I'm planning to debug this more while looking into https://github.com/mlc-ai/mlc-llm/issues/1295, the issue is that atm I'm unable to debug because there is no native build system integration with gradle which makes it unable to debug c++ code directly from Android Studio.
The initial loading deosnt' seem to be an issue anymore.. but the prompting with large sizes is. Following up on in #1401. Closing this one.
🐛 Bug
I'm noticing this both with using the default LLama-2-7b and TinyLlama-1.1b. When loading the model for the first time or when prompting the model for the first time. It halts the process and the UI freezes. It takes a couple seconds before the UI thread becomes responsive again.
If you have multiple workers in your application outside of the model loading, it will result in the Android OS killing the application with an ANR (Android Not Responding) message. These type of ANR happen when you try performing too much work on the main thread so I'm assuming we are missing some asynchronous loading of the model weights into memory or when we try prompting the model.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The model weights are loading asynchronously into memory and interacting with the model is also being executed on a background thread.
Environment
conda
, source): sourcepip
, source): sourcepython -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models): NA