Where can I get the llamav2.onnx model file from?

marvik-ai / triton-llama2-adapter

MIT License

18 stars 3 forks source link

I1206 19:08:51.768841 100 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002 I1206 19:12:21.760961 100 http_server.cc:3452] HTTP request: 2 /v2/models/llamav2/versions/1/infer I1206 19:12:21.761003 100 model_lifecycle.cc:328] GetModel() 'llamav2' version 1 I1206 19:12:21.761021 100 http_server.cc:2988] [request id: ] Infer failed: Request for unknown model: 'llamav2' is not found

Hey there, I played arround a bit with LLM's and triton. You need to export the transformer model to onnx format. To do so you can use a tool like : optimum-cli (install it with pip, configure it and run it).

You could use a command like this to export the model to onnx :

optimum-cli export onnx -m meta-llama/Llama-2-7b-chat-hf --task text-generation --device cuda --cache_dir ${PWD}/work/cache --no-post-process   ${OUT_FOLDER}

You will need a lot of resources through to export the model to onnx format, in my case I needed arround 30G of VRAM on my GPU and around 80G of RAM on the machine to export a 7b model.

source : https://huggingface.co/docs/transformers/serialization

marvik-ai / triton-llama2-adapter

Where can I get the llamav2.onnx model file from? #1