marvik-ai / triton-llama2-adapter

MIT License
18 stars 3 forks source link

Where can I get the llamav2.onnx model file from? #1

Open ctmackay opened 9 months ago

ctmackay commented 9 months ago

i'm running the python backend of the triton inference server. The server and client is running. However, the server cannot find the llamav2 model.

I1206 19:08:51.768841 100 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
I1206 19:12:21.760961 100 http_server.cc:3452] HTTP request: 2 /v2/models/llamav2/versions/1/infer
I1206 19:12:21.761003 100 model_lifecycle.cc:328] GetModel() 'llamav2' version 1
I1206 19:12:21.761021 100 http_server.cc:2988] [request id: ] Infer failed: Request for unknown model: 'llamav2' is not found

Is there a guide on how to convert the llama model to onnx? or where a onnx file is available?

AbstractVersion commented 8 months ago

Hey there, I played arround a bit with LLM's and triton. You need to export the transformer model to onnx format. To do so you can use a tool like : optimum-cli (install it with pip, configure it and run it).

You could use a command like this to export the model to onnx :

optimum-cli export onnx -m meta-llama/Llama-2-7b-chat-hf --task text-generation --device cuda --cache_dir ${PWD}/work/cache --no-post-process   ${OUT_FOLDER}

You will need a lot of resources through to export the model to onnx format, in my case I needed arround 30G of VRAM on my GPU and around 80G of RAM on the machine to export a 7b model.

source : https://huggingface.co/docs/transformers/serialization