Support remote inference on Triton + TensorRT or vLLM or TGI

stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).

https://crfm.stanford.edu/helm

Apache License 2.0

1.76k stars 234 forks source link

Support remote inference on Triton + TensorRT or vLLM or TGI #1997

Closed percyliang closed 4 months ago

percyliang commented 7 months ago

The preferred way to run models is to stand up an inference server (e.g., Triton + TensorRT or vLLM or TGI) locally and then hit it from HELM as an API. This way, HELM can benefit from all the crazy inference optimizations that are done. We need to demonstrate a proof of concept and write docs for this.

yifanmai commented 7 months ago

Opened draft PR #1975 for vLLM.

yifanmai commented 4 months ago

The TGI part is duplicated by #1866. I don't know of any users asking for Triton currently, so I will deprioritize that.