Open benjamintli opened 3 months ago
@benjamintli , thanks for reporting the issue! It's likely that the quantization regresses the model quality. It might be expected for smaller models with denser information in weights.
To verify, could you try to remove -X -qmode 8da4w --group_size 128 -d fp32
and see if the the results are better? If so, try to reduce the group_size from 128 to 64 and 32 and see if the quality is improved.
I'm trying to use this README: https://github.com/pytorch/executorch/tree/main/examples/models/llama2 on other llama2 based models like TinyLlamas 1.1B: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0.
Here's the params.json I wrote (based on the config):
I used the torchtune python script in the README to convert the safetensors in the huggingface repo to a state_dict:
And here's the command I ran to create the model:
When I try to run this on Android it gives me this
Which is a bit strange, definitely not the right output (this same model using gguf gives decently coherent responses.
Are the instructions in the README for converting llama7b models "supposed" to be applicable to any llama2 architecture model? Anyone know what's up with my setup here? Is there going to be a README/guide for how to convert huggingface formatted LLMs into something runnable in executorch?