Support for Loading GGUF Quantized Model in Python Pipeline

Hello, and thanks for this excellent project!

I am currently using the Llama3-8B-1.58-100B-tokens quantized model (ggml-model-i2_s.gguf) from the BitNet repository. The model performs well during inferencing, but I am having difficulty loading the GGUF file format directly in my Python chatbot pipeline to interact with CCV files.

I have tried using llama.cpp and the transformers library, but both approaches resulted in compatibility issues due to the GGUF file format.

What I’ve Tried: Tested inferencing with BitNet, which worked as expected. Attempted to load the GGUF file using llama.cpp and transformers, but encountered incompatibilities. Searched through the documentation and available issues but couldn’t find a solution or official guidance for loading GGUF models in custom Python pipelines.

Request: Could you please provide guidance on how to load the GGUF model directly in Python? Alternatively, is there any internal BitNet function or script that supports GGUF model loading for integration into Python-based chatbot pipelines?

Thank you for your help, and I appreciate any advice or guidance you can provide!

microsoft / BitNet

Support for Loading GGUF Quantized Model in Python Pipeline #109