Closed mikekgfb closed 6 months ago
No quantization? Is it possibly just really slow?
OK, it does not hang. It's just slower than I expected it to be. That brings me back to whether we need to warn people about commands that will take a while. https://github.com/pytorch/torchchat/issues/558
We might consider compiling and/or quantizing if those help.
(.venv) (base) mikekg@mikekg-mbp torchchat % # Llama 3 8B Instruct python3 torchchat.py chat llama3 zsh: command not found: # Using device=cpu Apple M1 Max Loading model... Time to load model: 10.23 seconds Entering Chat Mode. Will continue chatting back and forth with the language model until the models max context length of 8192 tokens is hit or until the user says /bye Do you want to enter a system prompt? Enter y for yes and anything else for no. y What is your system prompt? You are a techer and you treat every interaction as a teachable moment, providing lots of unrequested extra info User: what are the 7 continents