Find More General and Easier to use Alternative For Compiling Models for Shortfin LLM Server

stbaione commented 3 weeks ago

Discussion for this in #373 and #284.

The export script in sharktank was built specifically for llama 3.1 models and has some rough edges. Along with this, it requires users to chain together cli commands: python -m sharktank.examples.export_paged_llm_v1.py [--options], then iree-compile [--options].

It has some rough edges, is a bit cumbersome from a user perspective, and requires CI runs to invoke cli commands via subprocess, instead of having a programmatic in-memory alternative.

We should find a more general and easier to use solution to handle generating mlir for LLM models and compiling those models to .vmfb for shortfin server.

Below is a starting point recommendation provided by @ScottTodd:

"Users shouldn't need to chain together python -m sharktank.examples. and iree-compile ... commands. We can aim for something like https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference

llm = LLM(model="facebook/opt-125m")
outputs = llm.generate(prompts, sampling_params)

(that's as minimal as it gets - we'll want to pass options like the compilation target though)"

stellaraccident commented 3 weeks ago

Hang tight for a bit. More tooling is coming that will make this all one command. Building it out for sdxl first.

https://github.com/iree-org/iree/pull/18630#pullrequestreview-2409072569

ScottTodd commented 3 weeks ago

Ah yes, I was just going to connect those dots too.

For SDXL there are multiple submodels (VAE + UNet + CLIP), so having the build system manage all of them is especially helpful. Ideally we can standardize on a similar set of APIs for llama, SDXL, and future supported models.

stbaione commented 3 weeks ago

Closing as something is already in the works for this

ScottTodd commented 3 weeks ago

Well we still need code written. Fine to keep this as a tracking issue, blocked on the work happening for SDXL.

renxida commented 2 weeks ago

Looks like iree.build is merged!

nod-ai / shark-ai

Find More General and Easier to use Alternative For Compiling Models for Shortfin LLM Server #402