Added option to do probabilistic decoding (currently only temperature = 1 is set, but now params can easily be added)
Minor Changes
Streamlined generate/ (rm pipeline.py, integrate its functions into run_pipeline.py)
Things to consider:
Quantized models are still heavy to run inference with? Further testing to see how hardware, batchsize and bit size etc. affects this is needed. For bit size, currently running 4bit (main), but there are other compressed versions by TheBloke. Should also look into other alternatives e.g., using Mixtral-8x-7B or using vllm (both suggested by Kenneth)
(Major) Changes
Minor Changes
generate/
(rmpipeline.py
, integrate its functions intorun_pipeline.py
)Things to consider: