meta-llama / llama-stack

Model components of the Llama Stack APIs
MIT License
313 stars 48 forks source link

RFC-0001 - Llama Stack #6

Closed raghotham closed 3 weeks ago

raghotham commented 1 month ago

As part of the Llama 3.1 release, Meta is releasing an RFC for ‘Llama Stack’, a comprehensive set of interfaces / API for ML developers building on top of Llama foundation models. We are looking for feedback on where the API can be improved, any corner cases we may have missed and your general thoughts on how useful this will be.

Ultimately, our hope is to create a standard for working with Llama models in order to simplify the developer experience and foster innovation across the Llama ecosystem.

trholding commented 1 month ago

General Opinion: I feel there is a need for a 3b or 4b base model that makes it easier / faster to do inference on very low end hardware or CPU (which is my goal). The benefits would be less space and resources needed at the cost of accuracy / quality, but fine tuning that for the end user's use case could be a game changer. If quantized the models would be smaller and may be less usable for text generation, however it still could be very good for finding probabilities for tokens and be used in applications such as compression of text (llm-zip), logs, binaries and structured data yielding high compression ratio.

raghotham commented 1 month ago

Thanks for your comment @trholding. Completely understand the need for smaller models. In this release, we have provided FP8 quantized versions for our largest model (405B). In addition, we have released both dynamic-fp8 inference code as well as inference code for fp8 persisted weights part of the inference implementation. Please check them out.

mberman84 commented 1 month ago

Not sure this is the appropriate place for this, but a general question about the agentic framework: is it supposed to be a standalone agent framework or a language definition for other frameworks to plug into?

TheUniversalAxiom commented 1 month ago

The Universal Axiom Organic Transformer Model offers a transformative approach to enhance the Llama Stack API:

Dynamic Adaptation: Implement auto-scaling and continuous learning interfaces. Balanced Growth: Incorporate Fibonacci-inspired scaling for system stability. Ethical Alignment: Integrate ethical checkpoints and transparency layers. Multidimensional Processing: Develop multi-scale attention mechanisms for complex temporal data. Bias Mitigation: Implement an Axiomatic Subjectivity Scale for quantifying and reducing biases. Holistic Integration: Design interconnected architectures reflecting Impulses, Elements, and Pressures. Quantum-Inspired Algorithms: Enhance processing with probabilistic and entanglement-inspired methods.

This Axiom-based approach transforms Llama Stack into an evolving, ethically-aligned ecosystem that adapts to user needs and technological advancements.

For a deeper dive into how The Universal Axiom can revolutionize AI development and foster true innovation, visit: https://www.epiphanyengine.ai

Discover how this framework can elevate your projects beyond traditional AI paradigms.

tanliboy commented 1 month ago

I would like to share my wish list for your considerations:

Models

Fine-tuning

Agentic

anoopkatti commented 1 month ago

OpenAI recently released the Assistant API.

1) An Assistant is an LLM equipped with some tools. This allows the assistant to automatically select the right set/sequence of tools given a user request. This means an AI developer doesn't need to define when to select what tools. 2) Moreover, a user chats with an assistant within a thread. This allows the Assistant to use the prior context (by default) when continuing to chat. In the absence of threads, an AI developer should include all the relevant prior context in the follow-up prompts - thus wasting tokens.

AFAIK, LLama already supports automatic tool usage. But, it would be great if it can also natively support threads.

metaskills commented 1 month ago

@anoopkatti Thanks for calling out the Assistants API. I've been doing research on it using a mini framework called Experts.js. As I understand it, any model that supports function calling should be able to support a multi ai agent system such as this. So about Threads and how that API works:

First, the API is very "cloud resource" oriented. Meaning, instantiating an assistant creates a resource which API uses on the backend during inference. Second, threads ( managed memory ) is a cloud resource. It is really cool too and I'm seeing it pop up in several places, like Amazon Bedrock a few days ago. I think this is important because knowing the boundaries might help Llama Stack know where to implement things. For example, were you thinking llama inference start would be where the managed thread happens?

wamoyo commented 1 month ago

Ideally the API matches OpenAI's and Anthropic's for interoperability reasons. Of course, adding additional things is great! : ) But, if at least the basic parts of the API can match the other frontier models, we can use them interchangeably collaboratively etc.

bionicles commented 1 month ago

I disagree with the need for the API to match OpenAI/Anthropic, because they're using arrays of structs instead of structs of arrays, and it's well established that the latter data structure is better for performance.

Status Quo: Array of Structs (Object Oriented)

dialogue = [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What's the weather in SF?" },
]

Improved Paradigm: Struct of Arrays (Data Oriented Design)

dialogue = {
    "role": ["system", "user"],
    "content": ["You are a helpful assistant.", "What's the weather in SF?"]
}

in my experience, the performance benefit of this seemingly minor change can be vast, as there are way fewer memory allocations in the hot loop. If the number of messages is known in advance, the entire lists can be allocated upfront, for even greater benefit.

Yes, it is possible to pass the wrong length of list, but this can return a clear and helpful error message, and it's undoubtedly possible to screw up the other way, too.

wamoyo commented 1 month ago

Is there a good reason not to make both approaches available? At some level the problem of interoperability needs solving. I guess you see it better solved one level up in libraries?

Objects of arrays seems less human readable and harder to edit as the objects and arrays get bigger.

I disagree with the need for the API to match OpenAI/Anthropic, because they're using arrays of structs instead of structs of arrays, and it's well established that the latter data structure is better for performance.

Status Quo: Array of Structs (Object Oriented)

dialogue = [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What's the weather in SF?" },
]

Improved Paradigm: Struct of Arrays (Data Oriented Design)

dialogue = {
    "role": ["system", "user"],
    "content": ["You are a helpful assistant.", "What's the weather in SF?"]
}

in my experience, the performance benefit of this seemingly minor change can be vast, as there are way fewer memory allocations in the hot loop. If the number of messages is known in advance, the entire lists can be allocated upfront, for even greater benefit.

Yes, it is possible to pass the wrong length of list, but this can return a clear and helpful error message, and it's undoubtedly possible to screw up the other way, too.

geoffreya commented 1 month ago

BEFORE installing the toolchain, it would be nice to be able to 1-Read the api docs and 2-Read the example notebooks that use the api. The project's readme in github would be a nice place to provide the link to those.

bionicles commented 1 month ago

Is there a good reason not to make both approaches available? At some level the problem of interoperability needs solving. I guess you see it better solved one level up in libraries?

perfectly reasonable to support both, and pandas/arrow/polars are up to the task

juberti commented 1 month ago

The current OpenAI REST APIs make sense from a simple send-text-get-response standpoint, but as we consider other modalities (e.g., audio or video, as discussed in the tech report) a bidirectional streaming API seems like it would be very valuable. Also, given the modality-agnostic approach taken by Llama 3.1, there seems to be a lot of benefit in allowing the API to accept Llama 3 embeddings directly, which would enable others to add their own modalities.

(Similarly, you could imagine that the API outputs the final hidden states, rather than converting back to tokens, allowing projection and sampling to also occur downstream.)

theabhinavdas commented 1 month ago

BEFORE installing the toolchain, it would be nice to be able to 1-Read the api docs and 2-Read the example notebooks that use the api. The project's readme in github would be a nice place to provide the link to those.

It's open source :) You can contribute these changes and issue a PR.

theabhinavdas commented 1 month ago

Is there a good reason not to make both approaches available? At some level the problem of interoperability needs solving. I guess you see it better solved one level up in libraries?

Objects of arrays seems less human readable and harder to edit as the objects and arrays get bigger.

I disagree with the need for the API to match OpenAI/Anthropic, because they're using arrays of structs instead of structs of arrays, and it's well established that the latter data structure is better for performance.

Status Quo: Array of Structs (Object Oriented)

dialogue = [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What's the weather in SF?" },
]

Improved Paradigm: Struct of Arrays (Data Oriented Design)

dialogue = {
    "role": ["system", "user"],
    "content": ["You are a helpful assistant.", "What's the weather in SF?"]
}

in my experience, the performance benefit of this seemingly minor change can be vast, as there are way fewer memory allocations in the hot loop. If the number of messages is known in advance, the entire lists can be allocated upfront, for even greater benefit. Yes, it is possible to pass the wrong length of list, but this can return a clear and helpful error message, and it's undoubtedly possible to screw up the other way, too.

Cool idea though not necessarily true. Performance benefits will be related to how this data is accessed. SoA is cool, but AoS is more "readable". Most users of LLM implementations are not hardcore programmers and don't care about incremental performance benefits IMHO.

Anyway, a case can be made to use SoA in the "background" and let users write AoS if such performance benefits are actually meaningful.

jspisak commented 1 month ago

BEFORE installing the toolchain, it would be nice to be able to 1-Read the api docs and 2-Read the example notebooks that use the api. The project's readme in github would be a nice place to provide the link to those.

It's open source :) You can contribute these changes and issue a PR.

!00% - if you would like to suggest examples or demos that could be built on Llama Stack please do. And of course please contribute. We can collaborate to showcase any work as well :)

tantanchen commented 1 month ago

This is out of scope for llama-toolchain, but I want to point out a problem we have been having: There isn't a comprehensive solution that facilitates the post-training pipeline.

Across many industries, the SMEs who would be needed to annotate data are non-technical. So some kind of UI is needed. (Yes there are many annotation software with various of pros and cons.) Then there is data versioning and the data pipeline for just the labeled data. (Again yes there are tools that do this). Then there is the fine-tuning and evaluation step. (and yes there are separate tools for this). So all of the tools are there, but stringing them altogether is a challenge, and with a lot of these tools in development, everything is in flux. I would love to see a future where the llama stack includes all the steps in post training so that it can become the gold standard for continuous post-training. I realize that this is not a problem we can solve today, but I'm looking at the big picture and Mark's vision, and I think some kind of gold standard process for doing post-training would achieve Mark's vision.

bionicles commented 1 month ago

Cool idea though not necessarily true.

only one way to find out!!! here's a benchmark and results.

tinkering around, looks like a minor 5-7% speedup generating text with faker with arrays over dicts

Mostly a rehash of the good old "vectorization is faster than loops" everyone already knows about.

caveats:

code: https://gist.github.com/bionicles/4b31b01395c9b522ef17111d8a642aaf

results: image

image

thanks for the fun challenge!

juntao commented 1 month ago

APIs for embedding Llama models into cross-platform agentic apps

Hello, I am a maintainer at CNCF / Linux Foundation’s WasmEdge project. We are building a lightweight and cross-platform runtime for Meta Llama models.

A modern agentic application could require all the key components to be tightly-coupled in order to deliver the optimal user experience.

Therefore, we would like to see the Llama Stack standardize APIs that could embed the llama models into applications. And we need to have these APIs supported in multiple programming languages.

We made some initial progress with the Rust API for LlamaEdge. It allows developers to create their own LLM apps, such as a RAG enabled and search-enabled API server, in Rust, and have the app compiled to Wasm for cross-platform (portable across GPUs) deployment.

What do y’all think?

hshen14 commented 1 month ago

@raghotham This is a great initiative! Besides the inference in the model development cycle, I am wondering whether you have the plan to include deployment into the cycle, since people may have to consider authentication, scalability, cloud native with traffic management, etc. which are typically required during real deployment.

sydneylai commented 1 month ago

APIs for embedding Llama models into cross-platform agentic apps

Hello, I am a maintainer at CNCF / Linux Foundation’s WasmEdge project. We are building a lightweight and cross-platform runtime for Meta Llama models.

A modern agentic application could require all the key components to be tightly-coupled in order to deliver the optimal user experience.

  • a specific version of the model that is best fit for the application tasks
  • a specific quantization for the model
  • a specific version of the inference runtime (eg tokenizer that matches the model version)
  • hardware and software drivers required by the inference runtime
  • customized prompts for agent roles
  • model-specific and prompt-specific implementation of functions in tool use

Therefore, we would like to see the Llama Stack standardize APIs that could embed the llama models into applications. And we need to have these APIs supported in multiple programming languages.

We made some initial progress with the Rust API for LlamaEdge. It allows developers to create their own LLM apps, such as a RAG enabled and search-enabled API server, in Rust, and have the app compiled to Wasm for cross-platform (portable across GPUs) deployment.

What do y’all think?

I want to expand on the requests of model specificity by suggesting bidirectional streaming and model interpretability. Devs using API integrations are requesting a more dynamic dataflow and processing. Interpretability is another request I’ve been consistently receiving from devs. An additional example is converting computer language into human readable for prompting.

Specifically enhancing:

ashwinb commented 3 weeks ago

@mberman84

Not sure this is the appropriate place for this, but a general question about the agentic framework: is it supposed to be a standalone agent framework or a language definition for other frameworks to plug into?

A bit of both I guess. The API represents a language definition but Meta will certainly provide one implementation of it which you can then think of as a "framework". But the goal is that you will have multiple providers providing implementations abiding to the same general API.

ashwinb commented 3 weeks ago

@tanliboy

Re: structured / guided decoding -- yes, that's something we should incorporate. It's a really good point. We will have an update to the RFC around this soon.

Re: fine-tuning -- thanks for this feedback. All your points here are very salient and insightful, thank you! We hear the general sentiment from multiple folks. There's no specific update here unfortunately at this time, but we are actively thinking about these points.

ashwinb commented 3 weeks ago

@juberti

Yeah great point. We did think about whether we should think about Websockets or WebRTC APIs but decided to incorporate it as a future step. There is already quite a bit here to specify / accomplish. As we have models which need to use these features, these extensions will arrive.

ashwinb commented 3 weeks ago

@tantanchen

Across many industries, the SMEs who would be needed to annotate data are non-technical. So some kind of UI is needed. (Yes there are many annotation software with various of pros and cons.) Then there is data versioning and the data pipeline for just the labeled data. (Again yes there are tools that do this). Then there is the fine-tuning and evaluation step. (and yes there are separate tools for this). So all of the tools are there, but stringing them altogether is a challenge, and with a lot of these tools in development, everything is in flux. I would love to see a future where the llama stack includes all the steps in post training so that it can become the gold standard for continuous post-training. I realize that this is not a problem we can solve today, but I'm looking at the big picture and Mark's vision, and I think some kind of gold standard process for doing post-training would achieve Mark's vision.

Our goal is to certainly make the fine-tuning, alignment and evaluation portions simpler. We hope that we can cover the 80-85% most common use cases via the proposed APIs. But in general, we would not be able to completely subsume all parts of the post-training pipeline since it is rather customized to each task.

ashwinb commented 3 weeks ago

@juntao

APIs for embedding Llama models into cross-platform agentic apps

A modern agentic application could require all the key components to be tightly-coupled in order to deliver the optimal user experience.

  • a specific version of the model that is best fit for the application tasks
  • a specific quantization for the model
  • a specific version of the inference runtime (eg tokenizer that matches the model version)
  • hardware and software drivers required by the inference runtime
  • customized prompts for agent roles
  • model-specific and prompt-specific implementation of functions in tool use

We made some initial progress with the Rust API for LlamaEdge. It allows developers to create their own LLM apps, such as a RAG enabled and search-enabled API server, in Rust, and have the app compiled to Wasm for cross-platform (portable across GPUs) deployment.

What do y’all think?

We are looking into the end-to-end pipeline of finetuning, aligning llama models, registering the new models, to then be served for applications. We would love to chat more to see how we can learn from and incorporate the Rust API you have built.

ashwinb commented 3 weeks ago

@hshen14

@raghotham This is a great initiative! Besides the inference in the model development cycle, I am wondering whether you have the plan to include deployment into the cycle, since people may have to consider authentication, scalability, cloud native with traffic management, etc. which are typically required during real deployment.

While we will provide a way to register the finetuned/aligned models in the distribution, we will not be focusing on deployment. We hope to understand better from the community what patterns are used for deployment before we take on the effort of defining a deployment service.

juberti commented 3 weeks ago

@ashwinb

Thanks for the feedback on my request, glad to hear that streaming is on the long-term roadmap.

In the short term, do you think exposing an API that could directly accept input embeddings (eg from a pre-embedded text input or a multimodal adapter) would be possible?