Open Laeglaur opened 11 months ago
Share your HW configuration.
I think this might be memory related. I am running TinyChatEngine on Codespace with just 2core and 8Gb, and I get crashes.
TinyChatEngine by MIT HAN Lab: https://github.com/mit-han-lab/TinyChatEngine
Using model: LLaMA2_7B_chat
Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
Loading model... Finished!
USER: hello
ASSISTANT:
Hello! How can I help you today? Is there something specific you would like to know or talk about?
Inference latency, Total time: 8.3 s, 360.9 ms/token, 2.8 token/s, 23 tokens
USER: What do you know about the Tensor Virtual Machine framework?
ASSISTANT:
The Tensor Virtual Machine (TVM) is an open-source framework for building and deploying machine learning models. It was developed by Google and is designed to make it easier to train and deploy machine learning models on a variety of platforms, including cloud, edge, and embedded devices. TVM provides a high-level API that allows developers to write machine learning code in a simple and intuitive way, without worrying about the underlying hardware or software details. This makes it easier for developers to focus on building their models and deploying them in a variety of environments.
TVM supports a wide range of machine learning frameworks and libraries, including TensorFlow, PyTorch, and scikit-learn. It also provides tools for debugging and profiling models, as well as support for distributed training and deployment.
Overall, TVM isTerminated
running the chat in a local machine (WSL2, 12Gb allocated + 4Gb swap) has not produced a crash yet, but as the system swaps chat gets slower and slower.
Later, I will be migrating the Codespace to 4core, 16Gb to check if the crash occurs (with the same prompt).
Hi, I found the program will crash at first forward when input long context.. I just modifed the code to print the length of input tokens, and used some corpus to test the crash border. And I notice the first talking run will add some instruction, so I use
hello
to skip it.When the length of input tokens > 126, the program will crash. And I checked the crash position, it happends at the first loop of https://github.com/mit-han-lab/TinyChatEngine/blob/de720b46327ee3b8cbb20a069799ff2e69908a13/llm/src/nn_modules/non_cuda/LLaMAGenerate.cc#L75C38-L75C38
Have any idea about it?
Here is the test log.