Closed rmusser01 closed 1 month ago
Seems llama.cpp-python aint it chief.
Let's look at bundling/downloading llama.cpp compiled for the host platform (and if CUDA/ROCM is available or not). That way, can bundle as a package + allow for updates + use Llama to download HF models.
That or llamafile, perhaps checking for existence of CUDA/ROCM drivers/HW, and if found, downloading the appropriate llama.cpp release, otherwise using llamafile + MS Phi3 128k as a local model.
llamafile + https://huggingface.co/cognitivetech/samantha-mistral-instruct-7b_bulleted-notes_GGUF
Seems to be the best(slowest/easiest) method...
Llamafile implementation is in.
will download 1 of 2 models, and then use llamafile to run them in system ram, if the '--local_llm' argument is passed. Checks if the files exist already before downloading, does sha-256 verificaitons of the downloaded files to ensure integrity/not-incomplete.
Need to test to ensure the API works as expected.
Need to add options to:
Need to add method of killing llamafile when script exits, so as not to keep it running.
Tested on windows and confirmed working with 2 of 3 models. Phi3 for some reason just goes nutso when using it as part of the script. Will continue tweaking it, but the other two selected models work great...
Use https://github.com/abetlen/llama-cpp-python
to download + run MS Phi3 128k Context model
when proper CLI args are passed.
Will alllow individuals to not need anything else besides the application (and some free space...) to perform inference without struggles.