Closed atambay37 closed 3 months ago
To speed up inference.. we've decided to use llama.cpp with the llama-cpp-python Python Binding. However, this requires OLMo to be in GGUF file format and quantized to 4-bits to speed up the inference. Currently some of the base models in GGUF format is available at https://huggingface.co/collections/nopperl/olmo-gguf-66211a0071b6c3d66303fcf1.
@anantmittal is exploring to see if we can convert and quantize the OLMo-7B-Instruct since that has been tuned better for our usecase.
The Google colab listed under https://github.com/uw-ssec/tutorials/issues/20 has one writeup of the code part of this step. The final version will look a bit different as we will use a different / faster model.
To speed up inference.. we've decided to use llama.cpp with the llama-cpp-python Python Binding. However, this requires OLMo to be in GGUF file format and quantized to 4-bits to speed up the inference. Currently some of the base models in GGUF format is available at https://huggingface.co/collections/nopperl/olmo-gguf-66211a0071b6c3d66303fcf1.
@anantmittal is exploring to see if we can convert and quantize the OLMo-7B-Instruct since that has been tuned better for our usecase.
Added useful links here: https://github.com/uw-ssec/tutorials/discussions/52
OLMo team hasn't released the checkpoints for older instruct models, but will release 1.7 instruct is being trained, and its checkpoints will be released: https://github.com/allenai/OLMo/issues/553
Ask it several questions to show its capacity (e.g., ability to answer fact-based questions, when and where OLMo hallucinates, etc.)