Open KasperGroesLudvigsen opened 3 weeks ago
Also, it would be really cool to measure the energy consumption per token while we run inference
Or maybe its better to use offline batching with vllm https://docs.vllm.ai/en/v0.6.0/getting_started/quickstart.html
Very nice work, Meshach! I guess the next step would be to set up an LLM API in a Docker container on the GPU server (e.g. with VLLM) so that we can substitute your call to "gpt-3.5-turbo-instruct" with a local model. What do you think?