Open ctmackay opened 9 months ago
Hey there, I played arround a bit with LLM's and triton. You need to export the transformer model to onnx
format. To do so you can use a tool like : optimum-cli
(install it with pip, configure it and run it).
You could use a command like this to export the model to onnx :
optimum-cli export onnx -m meta-llama/Llama-2-7b-chat-hf --task text-generation --device cuda --cache_dir ${PWD}/work/cache --no-post-process ${OUT_FOLDER}
You will need a lot of resources through to export the model to onnx format, in my case I needed arround 30G of VRAM on my GPU and around 80G of RAM on the machine to export a 7b model.
source : https://huggingface.co/docs/transformers/serialization
i'm running the python backend of the triton inference server. The server and client is running. However, the server cannot find the llamav2 model.
Is there a guide on how to convert the llama model to onnx? or where a onnx file is available?