Add StreamingLLM support to studio2 chat

nod-ai / SHARK

SHARK - High Performance Machine Learning Distribution

Apache License 2.0

1.4k stars 169 forks source link

Add StreamingLLM support to studio2 chat #2060

Closed monorimet closed 5 months ago

monorimet commented 5 months ago

adds in call to gen_external_params for quantized weights generation
separates weights management from model init
separates vmfb loading from compilation (switch to vmfbRunner)
switches api test to use a smaller model

dan-garvey commented 5 months ago

should we adapt to just use the turbine llm runner?

monorimet commented 5 months ago

should we adapt to just use the turbine llm runner?

sounds OK to me, but will need some work to integrate with the UI might help with maintenance since turbine seems to be the favorite for new features/dev workflows... we just need to have an option to run with SRT and its bleeding edge flags etc

monorimet commented 5 months ago

issue filed for the edge case preventing us from running the api test on a small model with externalized weights https://github.com/openxla/iree/issues/16138

raikonenfnu commented 5 months ago

Is there a way we can specify the vmfb and safetensor path?

monorimet commented 5 months ago

Would be good to put in an option to set self.vmfb_path instead of renaming the target file. Will update in a few.

monorimet commented 5 months ago

Specifying vmfb path etc. will take a genuine CLI option interface or an option in the UI, both of which are significant changes and should come in a follow-up.

@dan-garvey I would prefer not to touch the finely balanced prompt handling until SDXL is finished, and we need this patch to clear the turbine CI.

monorimet commented 5 months ago

Please file issue for prompt handling, this is how we got where we were in 1.0

https://github.com/nod-ai/SHARK/issues/2073 filed