RFC-0001 - Llama Stack - Githubissues

raghotham commented 1 month ago

As part of the Llama 3.1 release, Meta is releasing an RFC for ‘Llama Stack’, a comprehensive set of interfaces / API for ML developers building on top of Llama foundation models. We are looking for feedback on where the API can be improved, any corner cases we may have missed and your general thoughts on how useful this will be.

Ultimately, our hope is to create a standard for working with Llama models in order to simplify the developer experience and foster innovation across the Llama ecosystem.

trholding commented 1 month ago

General Opinion: I feel there is a need for a 3b or 4b base model that makes it easier / faster to do inference on very low end hardware or CPU (which is my goal). The benefits would be less space and resources needed at the cost of accuracy / quality, but fine tuning that for the end user's use case could be a game changer. If quantized the models would be smaller and may be less usable for text generation, however it still could be very good for finding probabilities for tokens and be used in applications such as compression of text (llm-zip), logs, binaries and structured data yielding high compression ratio.

raghotham commented 1 month ago

Thanks for your comment @trholding. Completely understand the need for smaller models. In this release, we have provided FP8 quantized versions for our largest model (405B). In addition, we have released both dynamic-fp8 inference code as well as inference code for fp8 persisted weights part of the inference implementation. Please check them out.

mberman84 commented 1 month ago

Not sure this is the appropriate place for this, but a general question about the agentic framework: is it supposed to be a standalone agent framework or a language definition for other frameworks to plug into?

TheUniversalAxiom commented 1 month ago

The Universal Axiom Organic Transformer Model offers a transformative approach to enhance the Llama Stack API:

Dynamic Adaptation: Implement auto-scaling and continuous learning interfaces. Balanced Growth: Incorporate Fibonacci-inspired scaling for system stability. Ethical Alignment: Integrate ethical checkpoints and transparency layers. Multidimensional Processing: Develop multi-scale attention mechanisms for complex temporal data. Bias Mitigation: Implement an Axiomatic Subjectivity Scale for quantifying and reducing biases. Holistic Integration: Design interconnected architectures reflecting Impulses, Elements, and Pressures. Quantum-Inspired Algorithms: Enhance processing with probabilistic and entanglement-inspired methods.

This Axiom-based approach transforms Llama Stack into an evolving, ethically-aligned ecosystem that adapts to user needs and technological advancements.

For a deeper dive into how The Universal Axiom can revolutionize AI development and foster true innovation, visit: https://www.epiphanyengine.ai

Discover how this framework can elevate your projects beyond traditional AI paradigms.

tanliboy commented 1 month ago

I would like to share my wish list for your considerations:

Models

Restricted Prediction: It would be awesome to have native and efficient support for restricted prediction along with a pre-defined schema, such as JSON schema support in llama.cpp grammars. This could boost the usability of the models for generating reliable structured data.

Fine-tuning

Continued Pretraining: Maintaining the original data distribution during continued pre-training is tricky. A usual approach is to mix sampled pre-training data with new training data. If Llama models could provide access to a sampled pre-training dataset, it would make this process a lot smoother and ensure consistency.
Knowledge Distillation: The 405B model is amazing. I wish that we could have an E2E knowledge distillation tool/pipeline/API to fine-tune using token distributions from teacher models.
RS/DPO: It is good to know that Llama3.1 has switched from RLHF to Rejection Sampling and Direct Preference Optimization (DPO) for optimizing preferences. It would be amazing to offer sample fine-tuning recipes or best practices that could help us fine-tune Llama3.1 models effectively.

Agentic

MCTS + LLM: There have been some cool attempts to use Monte Carlo Tree Search (MCTS) to boost the logical reasoning capabilities of LLMs. However, there’s still a gap when it comes to tools or viable paths for tightly integrating MCTS with LLMs, except for some rumored projects in closed-source models. Creating a robust tool or interface for this integration would be a huge win for AI agent developers.

anoopkatti commented 1 month ago

OpenAI recently released the Assistant API.

1) An Assistant is an LLM equipped with some tools. This allows the assistant to automatically select the right set/sequence of tools given a user request. This means an AI developer doesn't need to define when to select what tools. 2) Moreover, a user chats with an assistant within a thread. This allows the Assistant to use the prior context (by default) when continuing to chat. In the absence of threads, an AI developer should include all the relevant prior context in the follow-up prompts - thus wasting tokens.

AFAIK, LLama already supports automatic tool usage. But, it would be great if it can also natively support threads.

metaskills commented 1 month ago

@anoopkatti Thanks for calling out the Assistants API. I've been doing research on it using a mini framework called Experts.js. As I understand it, any model that supports function calling should be able to support a multi ai agent system such as this. So about Threads and how that API works:

First, the API is very "cloud resource" oriented. Meaning, instantiating an assistant creates a resource which API uses on the backend during inference. Second, threads ( managed memory ) is a cloud resource. It is really cool too and I'm seeing it pop up in several places, like Amazon Bedrock a few days ago. I think this is important because knowing the boundaries might help Llama Stack know where to implement things. For example, were you thinking llama inference start would be where the managed thread happens?

wamoyo commented 1 month ago

Ideally the API matches OpenAI's and Anthropic's for interoperability reasons. Of course, adding additional things is great! : ) But, if at least the basic parts of the API can match the other frontier models, we can use them interchangeably collaboratively etc.

bionicles commented 1 month ago

I disagree with the need for the API to match OpenAI/Anthropic, because they're using arrays of structs instead of structs of arrays, and it's well established that the latter data structure is better for performance.

Status Quo: Array of Structs (Object Oriented)

dialogue = [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What's the weather in SF?" },
]

Improved Paradigm: Struct of Arrays (Data Oriented Design)

dialogue = {
    "role": ["system", "user"],
    "content": ["You are a helpful assistant.", "What's the weather in SF?"]
}

in my experience, the performance benefit of this seemingly minor change can be vast, as there are way fewer memory allocations in the hot loop. If the number of messages is known in advance, the entire lists can be allocated upfront, for even greater benefit.

Yes, it is possible to pass the wrong length of list, but this can return a clear and helpful error message, and it's undoubtedly possible to screw up the other way, too.

wamoyo commented 1 month ago

Is there a good reason not to make both approaches available? At some level the problem of interoperability needs solving. I guess you see it better solved one level up in libraries?

Objects of arrays seems less human readable and harder to edit as the objects and arrays get bigger.

I disagree with the need for the API to match OpenAI/Anthropic, because they're using arrays of structs instead of structs of arrays, and it's well established that the latter data structure is better for performance.

Status Quo: Array of Structs (Object Oriented)
dialogue = [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What's the weather in SF?" },
]
Improved Paradigm: Struct of Arrays (Data Oriented Design)
dialogue = {
    "role": ["system", "user"],
    "content": ["You are a helpful assistant.", "What's the weather in SF?"]
}
in my experience, the performance benefit of this seemingly minor change can be vast, as there are way fewer memory allocations in the hot loop. If the number of messages is known in advance, the entire lists can be allocated upfront, for even greater benefit.

Yes, it is possible to pass the wrong length of list, but this can return a clear and helpful error message, and it's undoubtedly possible to screw up the other way, too.

geoffreya commented 1 month ago

BEFORE installing the toolchain, it would be nice to be able to 1-Read the api docs and 2-Read the example notebooks that use the api. The project's readme in github would be a nice place to provide the link to those.

bionicles commented 1 month ago

Is there a good reason not to make both approaches available? At some level the problem of interoperability needs solving. I guess you see it better solved one level up in libraries?

perfectly reasonable to support both, and pandas/arrow/polars are up to the task

juberti commented 1 month ago

The current OpenAI REST APIs make sense from a simple send-text-get-response standpoint, but as we consider other modalities (e.g., audio or video, as discussed in the tech report) a bidirectional streaming API seems like it would be very valuable. Also, given the modality-agnostic approach taken by Llama 3.1, there seems to be a lot of benefit in allowing the API to accept Llama 3 embeddings directly, which would enable others to add their own modalities.

(Similarly, you could imagine that the API outputs the final hidden states, rather than converting back to tokens, allowing projection and sampling to also occur downstream.)

theabhinavdas commented 1 month ago

BEFORE installing the toolchain, it would be nice to be able to 1-Read the api docs and 2-Read the example notebooks that use the api. The project's readme in github would be a nice place to provide the link to those.

It's open source :) You can contribute these changes and issue a PR.

theabhinavdas commented 1 month ago

Is there a good reason not to make both approaches available? At some level the problem of interoperability needs solving. I guess you see it better solved one level up in libraries?

Objects of arrays seems less human readable and harder to edit as the objects and arrays get bigger.
I disagree with the need for the API to match OpenAI/Anthropic, because they're using arrays of structs instead of structs of arrays, and it's well established that the latter data structure is better for performance.

Status Quo: Array of Structs (Object Oriented)
dialogue = [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What's the weather in SF?" },
]
Improved Paradigm: Struct of Arrays (Data Oriented Design)
dialogue = {
    "role": ["system", "user"],
    "content": ["You are a helpful assistant.", "What's the weather in SF?"]
}
in my experience, the performance benefit of this seemingly minor change can be vast, as there are way fewer memory allocations in the hot loop. If the number of messages is known in advance, the entire lists can be allocated upfront, for even greater benefit. Yes, it is possible to pass the wrong length of list, but this can return a clear and helpful error message, and it's undoubtedly possible to screw up the other way, too.

Cool idea though not necessarily true. Performance benefits will be related to how this data is accessed. SoA is cool, but AoS is more "readable". Most users of LLM implementations are not hardcore programmers and don't care about incremental performance benefits IMHO.

Anyway, a case can be made to use SoA in the "background" and let users write AoS if such performance benefits are actually meaningful.

jspisak commented 1 month ago

BEFORE installing the toolchain, it would be nice to be able to 1-Read the api docs and 2-Read the example notebooks that use the api. The project's readme in github would be a nice place to provide the link to those.

It's open source :) You can contribute these changes and issue a PR.

!00% - if you would like to suggest examples or demos that could be built on Llama Stack please do. And of course please contribute. We can collaborate to showcase any work as well :)

tantanchen commented 1 month ago

This is out of scope for llama-toolchain, but I want to point out a problem we have been having: There isn't a comprehensive solution that facilitates the post-training pipeline.

Across many industries, the SMEs who would be needed to annotate data are non-technical. So some kind of UI is needed. (Yes there are many annotation software with various of pros and cons.) Then there is data versioning and the data pipeline for just the labeled data. (Again yes there are tools that do this). Then there is the fine-tuning and evaluation step. (and yes there are separate tools for this). So all of the tools are there, but stringing them altogether is a challenge, and with a lot of these tools in development, everything is in flux. I would love to see a future where the llama stack includes all the steps in post training so that it can become the gold standard for continuous post-training. I realize that this is not a problem we can solve today, but I'm looking at the big picture and Mark's vision, and I think some kind of gold standard process for doing post-training would achieve Mark's vision.

bionicles commented 1 month ago

Cool idea though not necessarily true.

only one way to find out!!! here's a benchmark and results.

tinkering around, looks like a minor 5-7% speedup generating text with faker with arrays over dicts

Mostly a rehash of the good old "vectorization is faster than loops" everyone already knows about.

caveats:

not using llama. using faker
not straightforward to vectorize llama like this (or is it?)

code: https://gist.github.com/bionicles/4b31b01395c9b522ef17111d8a642aaf

results:

thanks for the fun challenge!

juntao commented 1 month ago

APIs for embedding Llama models into cross-platform agentic apps

Hello, I am a maintainer at CNCF / Linux Foundation’s WasmEdge project. We are building a lightweight and cross-platform runtime for Meta Llama models.

A modern agentic application could require all the key components to be tightly-coupled in order to deliver the optimal user experience.

a specific version of the model that is best fit for the application tasks
a specific quantization for the model
a specific version of the inference runtime (eg tokenizer that matches the model version)
hardware and software drivers required by the inference runtime
customized prompts for agent roles
model-specific and prompt-specific implementation of functions in tool use

Therefore, we would like to see the Llama Stack standardize APIs that could embed the llama models into applications. And we need to have these APIs supported in multiple programming languages.

We made some initial progress with the Rust API for LlamaEdge. It allows developers to create their own LLM apps, such as a RAG enabled and search-enabled API server, in Rust, and have the app compiled to Wasm for cross-platform (portable across GPUs) deployment.

What do y’all think?

hshen14 commented 1 month ago

@raghotham This is a great initiative! Besides the inference in the model development cycle, I am wondering whether you have the plan to include deployment into the cycle, since people may have to consider authentication, scalability, cloud native with traffic management, etc. which are typically required during real deployment.

sydneylai commented 1 month ago

APIs for embedding Llama models into cross-platform agentic apps

Hello, I am a maintainer at CNCF / Linux Foundation’s WasmEdge project. We are building a lightweight and cross-platform runtime for Meta Llama models.

A modern agentic application could require all the key components to be tightly-coupled in order to deliver the optimal user experience.

a specific version of the model that is best fit for the application tasks

a specific quantization for the model

a specific version of the inference runtime (eg tokenizer that matches the model version)

hardware and software drivers required by the inference runtime

customized prompts for agent roles

model-specific and prompt-specific implementation of functions in tool use

Therefore, we would like to see the Llama Stack standardize APIs that could embed the llama models into applications. And we need to have these APIs supported in multiple programming languages.

We made some initial progress with the Rust API for LlamaEdge. It allows developers to create their own LLM apps, such as a RAG enabled and search-enabled API server, in Rust, and have the app compiled to Wasm for cross-platform (portable across GPUs) deployment.

What do y’all think?

I want to expand on the requests of model specificity by suggesting bidirectional streaming and model interpretability. Devs using API integrations are requesting a more dynamic dataflow and processing. Interpretability is another request I’ve been consistently receiving from devs. An additional example is converting computer language into human readable for prompting.

Specifically enhancing:

global and local explanations, to understand overall model behavior
identify minimal input changes leading to different outputs
determine cause and effect between inputs and outputs

To support these enhancements, I’d love to support in the following:
standardizing interpretability metrics
integration with existing open-source tools