opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
84 stars 118 forks source link

[RFC] Conversations and Generative AI in OpenSearch #1150

Closed austintlee closed 3 months ago

austintlee commented 11 months ago

Introduction

The recent advances in Large Language Models (LLMs) have enabled developers to utilize natural language in their applications with better quality and ability. As ChatGPT has shown, these LLMs strongly enable use cases involving summarization and conversation. However, when prompting LLMs to answer fact-based questions (applications we call “conversational search”), we find that there are significant shortcomings for enterprise-grade applications.

First, the major LLMs are not trained on datasets that are not exposed to the internet, and therefore do not have the context to answer questions on private data. Most enterprise data falls into this category. Second, the way in which LLMs answer questions based on their training data gives rise to “hallucinations” and false answers, which are not acceptable in applications for mission critical use cases.

End-users love the ability to converse using colloquial language with an application to get answers to questions or find interesting search results, but require up-to-date information and accuracy. A solution to this problem is through Retrieval Augmented Generation (RAG), where an application sends an LLM a superset of correct information in response to a prompt, and the LLM is used to summarize and extract information from this set (instead of probabilistically determining an answer).

We believe OpenSearch could be a great platform for building conversational search applications, and aligns well with the RAG approach. It already offers semantic search capabilities using its vector database and k-NN plug-in, alongside enterprise-grade security and scalability. This is a great building block for the “source of truth” information retrieval component of RAG. However, it currently lacks the primitives and crisp APIs to easily enable the conversational element.

Although there are libraries that allow for building this functionality at the application layer (e.g. LangChain), we believe the best developer experience would be to enable this directly in OpenSearch. We consider the “G” in a RAG pipeline as LLM-based post-processing to enable direct question answering, summarization, and a conversational experience on top of OpenSearch semantic search. This enables end-users to interact with their data in OpenSearch in new ways. Furthermore, we believe developers may want to use different LLMs, and that the choice of model should be pluggable.

Through using plugins and search pipelines, we propose an architecture in this RFC to expose easily consumable APIs for conversational search, history, and storage. We segment it into a few components, including: 1/search query rewriting using generative AI and conversational context, 2/question answering and summarization of OpenSearch semantic search queries using generative AI, and 3/a concept of “conversational memory” to easily store the state of conversations and add additional interactions. Conversational Memory will also support conversational applications that have multiple agents operating together, giving a single source of truth for conversation state.

Goals

1/ Developers can easily build conversational search applications (e.g. knowledge-base search, informational chatbot, etc.) using OpenSearch and their choice of generative AI model using well-defined REST APIs. Some of these applications will be an ongoing conversation, while others will be one-shot (and the history of interactions is not important).

2/ Developers can use OpenSearch to support multi-agent conversational architectures, which require a single “source of truth” for conversational history. Multi-agent architectures will have other agents besides that for semantic search with OpenSearch (e.g. an agent that queries the public internet). These developers need an easy API to manage conversational history, both in adding interactions to conversations and exploring history of those conversations.

3/ Developers can easily obtain OpenSearch (semantic) search results alongside the generative AI question answering, so they can show the source documents and enable the end user to explore the source material.

Non-Goals

1/ Building a general LLM application toolkit in OpenSearch. Our goal is just to enable conversational search and the related dependency of conversational memory.

2/ LLM hosting. LLMs take significant resources and should be operated outside of an OpenSearch cluster. We also hope to use the ML-Commons remote inference feature rather than implement our own connectors.

3/ A conversational search application platform. Our goal is to expose crisp APIs to make building applications that use conversational search easy, but not create the end application itself.

Proposed Architecture

![Aryn Conversation Plugins v2](https://github.com/opensearch-project/ml-commons/assets/1363802/a171b6aa-76ce-4c17-b874-6eeb244d6b21) ### Conversational Memory API (Chat History) Conversational memory is the storage for conversations, which are an ordered list of interactions. Conversational memory makes it easy to add new interactions to a conversation or explore previous interactions. For example, you would need conversational memory to write a chatbot, since it takes the previous interactions in a conversation as part of the context for generating a future response. At a high level, this mostly resembles a generic read/write store, and we will use an OpenSearch index for it. However, the interesting nuance is in the data itself, which we will describe next. A conversation is represented as a list of interactions, ordered chronologically. Each conversation will also include some metadata, like the start time and the number of interactions. The basic elements of an interaction are an input and a response, representing the human input to an AI agent and that agent’s response. We’ll also include any additional prompting that was used in the interaction, the agent that was used in this interaction, and possible arbitrary metadata that the agent may want to include. For example, a conversational search agent may include the actual search results as metadata for a user search query (which is an interaction). Each `ConversationMetadata` and `Interaction` will have access controls linked to the specific user that creates them. Only Alice can add to and read from conversations that Alice owns. The main rationale for this is that Alice’s conversation will potentially include information from all documents Alice has access to, so her conversations’ access controls are maximally the intersection of Alice’s access rights. We plan to leverage OpenSearch’s existing access control mechanisms for this. The plan is to maintain 2 indices - 1 for `ConversationMetadata` and 1 for `Interaction`. ``` structure ConversationMetadata { conversationId: ConversationId numInteractions: Integer createTime: Timestamp lastInteractionTime: Timestamp name: String } structure Interaction { conversationId: ConversationId interactionId: InteractionId input: String prompt: String response: String agent: String time: Timestamp attributes: InteractionAttributes } ``` **API** The operations for conversational memory are similar to the usual CRUD operations for a datastore. `CreateInteraction` will update the appropriate `ConversationMetadata` to have a correct `lastInteractionTime` and `numInteractions` ``` /// Creates a new conversation and returns its id operation CreateConversation { input: CreateConversationInput output: CreateConversationOutput } @input structure CreateConversationInput { name: String } @output structure CreateConversationOutput { conversationId: ConversationId } /// Returns the list of all conversations operation GetConversations { input: GetConversationsInput output: GetConversationsOutput } @input structure GetConversationsInput { nextToken: String maxResults: Integer } @output structure GetConversationsOutput { conversations: List[ConversationMetadata] nextToken: String } /// Adds an interaction to a conversation and returns its id operation CreateInteraction { input: CreateInteractionInput output: CreateInteractionOutput } @input structure CreateInteractionInput { @required @httpLabel conversationId: ConversationId input: String prompt: String response: String agent: String attributes: InteractionAttributes } @output structure CreateInteractionOutput { interactionId: InteractionId } /// Returns the list of interactions associated with a conversation operation GetInteractions { input: GetInteractionsInput output: GetInteractionsOutput } @input structure GetInteractionsInput { @required @httpLabel conversationId: ConversationId nextToken: String maxResults: Integer } @output structure GetInteractionsOutput { metadata: ConversationMetadata interactions: List[Interaction] nextToken: String } operation DeleteConversation { input: DeleteConversationInput output: DeleteConversationOutput } @input structure DeleteConversationInput { @required @httpLabel conversationId: ConversationId } @output structure DeleteConversationOutput { success: Boolean } ``` We do not propose having an update API for conversation metadata, and we treat this as immutable. We believe that users would prefer to just create a new conversation than update parameters on an existing one. ### Search Pipeline extension The conversational search path essentially consists of an OpenSearch query, with some pre- and post-processing. Search Pipelines, introduced in 2.8, are a tool for pre- and post-processing in the query path, so we have chosen to use that mechanism to implement conversational search. We have chosen to implement the question answering component of RAG in the form of query result rewrites. We are introducing a new response processor that sends the top search results, and optionally some previous conversation history to the LLM to generate a response in the conversation. We are also introducing a new response processor that iterates over search hits and interacts with an LLM to produce an answer for each result with a score. Finally, we are introducing a request processor to rephrase the user’s query, taking into account the conversation history. We will rely on the remote inference feature proposed in https://github.com/opensearch-project/ml-commons/issues/882 for answer generation. Based on different patterns we have seen with applications, we designed this API to support “one-off” and “multi-shot” conversations. Users can have “one-off” question answering interactions, where the prior context is not included, via a search pipeline that uses this new question answering processor. Users can also have “multi-shot” conversations where interactions are stored in conversational memory and are used as additional context that is sent to the model along with each search query. Users will need to use the Conversational Search plugin to create a conversation and pass the conversationId to the search pipeline in order to retain all the interactions associated with it. In addition to the conversation ID, users can also pass a “prompt” parameter for any prompt engineering alongside their search query. ``` GET wiki-simple-paras/_search?search_pipeline=convo_qa_pipeline { "_source": ["title", "text"], "query" : { "neural": { "text_vector": { "query_text": "When was Abraham Lincoln born?", "k": 10, "model_id": "" } } }, "ext": { "question_answering_parameters": { "question": "When was Abraham Lincoln born?" }, "conversation" : { "id": "...", "prompt": "..." } } } ``` The search pipeline includes pre and post processing steps. The pre-processing step uses generative AI to rewrite the search query submitted by the user, taking into account the conversation history if a conversation was specified. This allows things like antecedent replacement (”When was he born?” → “When was Abraham Lincoln born?”, if the prior question was “Who was Abraham Lincoln?”). The post-processing step is a processor that takes the search results, optionally performs a lookup against the conversational memory, and then sends this data to the LLM configured by the user. We believe different users will want to use different LLMs, so this will be pluggable. ### Conversation API The point of this API is to provide conversational search as a relatively simple endpoint, hooking pieces together such that the user can easily build an application with it. It takes a search query (or some other kind of human input), performs a search against OpenSearch, and then feeds those search results into an LLM and returns the answer. All of this work is done in the search pipeline underneath - so the API is just a wrapper - but we feel this kind of an API would be helpful to developers who just want an easy REST API. We would like to return search results as well as the LLM response. This differs from most existing systems that return only answers, and it allows clients to perform validations or additional downstream processing. ``` /// Ask a question and get a GenAI response grounded in search results operation Query { input: QueryInput output: QueryOutput } structure QueryInput { index: String conversationId: ConversationId query: String prompt: String filter: String numResults: Integer } structure QueryOutput { response: String rewrittenQuery: String searchResults: DocList interactionId: InteractionId } /// List of docs used to answer the question list DocList { member: Document } ``` # Discussion 1. **Performance:** LLM inference takes on the order of seconds; if you have sufficiently high traffic, that can increase to minutes or more as an LLM hosting service rate-limits or a hosted model becomes resource constrained. Five people using this at the same time could have the potential to completely stall each other out. We’ll try to be fault-tolerant as regards this, but a lot of the onus may fall on the users and the LLM hosters to work out how to get higher LLM throughput. 2. **Ordering:** Since LLM inference can take a while, a user might get impatient and ask a bunch of search queries before the first search query has returned an answer; and the answers might come back from the LLM out of order. We will write only complete interactions, meaning the order that messages come back from the LLM. The client should disallow multiple queries at once (in a conversation) to prevent this. 3. **Dependencies:** This relies on the relatively new search pipeline and remote inference features. Accordingly, this probably only works for OpenSearch ≥ 2.9, with the appropriate ML-Commons installation. We’re also hoping to get the pipelines themselves into Search-Processors; in which case that plugin also becomes a dependency. Lastly, the high-level Conversational API depends on the Conversational Pipeline, and they both depend on the Conversational Memory plugin, which we think should be its own plugin. We’ll put out some resources on building once we figure it out. # Summary In this RFC we gave a proposal for bringing conversational search into OpenSearch. Our proposal consists of three components: 1/ an API for conversational memory stored in OpenSearch, 2/ an OpenSearch search pipeline for Retrieval-Augmented Generation (RAG), and 3/ an API that provides a simple one-shot API for conversational search applications. We would appreciate any feedback, suggestions, and comments towards integrating this cleanly with the rest of the OpenSearch ecosystem and making it the best it can be. Thanks! # Requested Feedback - Does this feature set cover the set of use cases for generative AI applications that you want to build? We have been focused on search applications and we’re interested in how much the community wants to go beyond exposing conversational search and conversational memory building blocks at this time. - We believe the search pipeline is great mechanism to define a RAG pipelines, but we also felt that a conversational API that invokes this pipeline would be helpful for developers to more easily build conversational search applications. We’d love feedback on if we should add more to this API, or conversely if it’s even needed in providing an easy developer experience. - This approach for RAG introduces a several cross plugin dependencies. There has been talk in the community about moving away from the plugin architecture for OpenSearch, and we want to make sure this approach is aligned with the higher-level architectural goals of the project. We’d appreciate feedback on this topic.
davidlago commented 11 months ago

Each ConversationMetadata and Interaction will have access controls linked to the specific user that creates them. Only Alice can add to and read from conversations that Alice owns. The main rationale for this is that Alice’s conversation will potentially include information from all documents Alice has access to, so her conversations’ access controls are maximally the intersection of Alice’s access rights. We plan to leverage OpenSearch’s existing access control mechanisms for this.

There is an important nuance to this statement: her conversations’ access controls are maximally the intersection of Alice’s access rights at the time of the interaction.

If Alice's permissions change from the time of the interaction in a way that makes some of the captured information off-limits to her, this access control will no longer be appropriate.

We don't currently have the needed security primitives/functionality to support this level of access control on derived data natively (it's in our radar though!), so limiting access to the interactions to the owning user is the best we can do without them.

With that said, am I correct in interpreting that the indices that will hold the Conversations and Interactions will be restricted to just the plugin and all access to the data gated by the new API? If so, they are missing a field with the user who owns them so that we can enforce that access control.

macohen commented 11 months ago

This is a very thorough RFC. Thanks, Austin.

Dependencies: This relies on the relatively new search pipeline and remote inference features. Accordingly, this probably only works for OpenSearch ≥ 2.9, with the appropriate ML-Commons installation. We’re also hoping to get the pipelines themselves into Search-Processors; in which case that plugin also becomes a dependency. Lastly, the high-level Conversational API depends on the Conversational Pipeline, and they both depend on the Conversational Memory plugin, which we think should be its own plugin. We’ll put out some resources on building once we figure it out.

Confirming that Search Pipelines is only available in 2.9+. When you say "...hoping to get the pipelines themselves into Search-Processors..." do you mean the search-processor GH repo? That repo has two processors that we will eventually factor out into separate repos. Our current thinking on search processors is that they can be included in core (https://github.com/opensearch-project/OpenSearch) if they have no external dependencies. If there are dependencies, a separate repo as a self-install plugin is the right approach. Some of this may belong in ml-commons, but I would leave that up to the maintainers of this repo.

You may also need to build a search processor that is ALSO a plugin to gain access to resources via the plugin interface. It may be necessary to build a processor that is also a plugin to access the conversation memory, for example. One analogy for search pipelines is to think about it like piping together *NIX commands. Each command (processor in pipeline speak) can be as complex as needed, but still really only does one thing, and then you compose functionality by sending the stdin (request in pipeline speak) or stdout (response in pipeline speak) from one processor to the next. Some of the ones needed may end up in core; some may end up in a separate repo.

cc: @msfroh

jngz-es commented 11 months ago

About Conversation API, it looks like only wrap up a search pipeline inside, I think the APIs of search pipeline work well already. Meanwhile I think conversation function can support not only search application but also others applications like chatbot etc. That means the conversation function can help users build any conversational applications.

jngz-es commented 11 months ago

About performance, not only traffic (for multi-users) but also the latency (for single user) we should consider. The search experience is latency sensitive, if we introduced llm interactions in pre-processors and post-processors, the latency could be not acceptable. We should reduce the rounds of interactions with llm to improve latency and save the cost, as the llm api call is expensive.

jngz-es commented 11 months ago

Actually we also have a RFC about conversation plugin in OpenSearch to support conversational application building.

HenryL27 commented 11 months ago

Thanks @davidlago.

I agree that limiting access to the interactions to the owning user is the best we can do currently. We would love to collaborate on building the necessary security primitives to support access control on derived data. Please keep us informed on any future RFC on this.

Yes, we're planning on restricting access to the conversational memory indices to the plugin / API. We'll be keeping track of the user under the covers in the index.

austintlee commented 11 months ago

@macohen Thanks for the clarification and suggestions.

austintlee commented 11 months ago

@jngz-es We agree that latency is important, and we will certainly look for ways to reduce unnecessary round-trips. That said, we have seen good results in some cases by using an LLM to both rewrite the query and summarize the response. We believe that some users will be okay with the additional latency for better results, and we want that to at least be an option.

macohen commented 11 months ago

In a conversational search application I also think there can be some expected latency from users vs a keyword search. Think time for conversations are acceptable in general, right?

jonfritz commented 11 months ago

Thanks to the folks who have responded to the RFC we posted a couple days ago for Conversations and Generative AI in OpenSearch. @jngz-es - I noticed that you recently posted an RFC on the same topic (#1151). I'm concerned that the overlap will cause confusion in the community and make it difficult to align our development.

We would love to find a process where we can work together. The process that I’m used to in open source communities is to start with one RFC and then iterate and add feedback rather than creating multiple RFCs. This process has some benefits - it drives alignment in the open, enables the community to share and iterate on ideas, and makes the end product easy to understand and use.

My suggestion is that we adopt this approach to work together on the RFC for conversational features in OpenSearch. We greatly appreciate the feedback you've already given this original RFC, and we'd be happy to do the work to update this RFC and continue to iterate to incorporate any other technical suggestions you have. Let us know what you think! We are excited to find ways to work together to make OpenSearch the best platform for building conversational applications.

dblock commented 11 months ago

Love seeing multiple proposals for similar outcomes! Personally I don't think there's anything wrong with two competing implementations that potentially converge into the best in class. Without diving too much into details, @austintlee and @jngz-es, what are the similarities and differences between the two proposals? What do you think is better in the one you didn't write?

dylan-tong-aws commented 11 months ago

Hi Austin,

I love how you intuitively architected your application with the components in the intended way with the new building blocks like search pipelines, AI connectors and vector database capabilities. I expected that we needed to better document this.

With that said, we are working on the next iteration of the framework to simplify and improve the developer experience. Some concepts that we're considering:

  1. We like to introduce the notion of use case templates. Imagine a single declarative interface to describe a use case like semantic search or RAG with prescriptive default configurations for search pipelines (eg. RAG), prompt engineering routines and AI service connectors which developers can selectively reconfigure.

  2. We're exploring the idea in (1) to provide a no-code interface like LangFlow or Flowise, but in the scope of OpenSearch powered AI apps. You'll have the option of a no code interface to configure and prime OpenSearch for your specific use case to modify or generate your use case template.

What are your thoughts?

austintlee commented 11 months ago

@macohen @msfroh Do you have any suggestions for how we might return answers generated by LLMs in the SearchResponse?

I think there are largely three approaches.

1/ The most "intrusive" approach would be to introduce a new field in the SearchResponse, e.g.

{
  "conversation": {
    "id": "...",
    "answer": "..."
   },
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 6,
    "successful": 6,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3205,
      "relation": "eq"
    },
    "max_score": 3.641852,
    "hits": [
      {
        ...
       }
    ]
  }
}

2/ We can return it as one of the SearchHits by inserting the answer into the Hits array in the response processor (which means we would need reconstruct the response object on the way out).

3/ A middle ground would be an extension ("ext") to the response that can be customized by Search Pipelines:

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 6,
    "successful": 6,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3205,
      "relation": "eq"
    },
    "max_score": 3.641852,
    "hits": [
      {
        ...
       }
    ]
  },
  "ext": {
    "conversation": {
      "id": "...",
      "answer": "..."
     },
  }
}

Would this option be made possible as part of perhaps this work - https://github.com/opensearch-project/OpenSearch/issues/8635?

msfroh commented 11 months ago

I really like the the proposal and have a few questions / comments, mostly around the conversation memory:

Data store

we will use an OpenSearch index for it

Is this a hard requirement? It does feel like the most obvious place for it (since we're already running on OpenSearch, it adds no additional dependencies), but maybe someone might benefit from some other data store? Each conversation is an append-only log, if I'm understanding correctly, so another data store might be a good fit. (Of course, I hear that a lot of people like storing their append-only logs in OpenSearch indices, so maybe it really is the best option.)

Metadata

structure ConversationMetadata {
    conversationId: ConversationId
    numInteractions: Integer
    createTime: Timestamp
    lastInteractionTime: Timestamp
    name: String
}

If numInteractions and lastInteractionTime are left out of the explicit schema of the persisted entity, then ConversationMetadata is immutable, which is nice. I'll kind of contradict my comment above and say that they're "pretty cheap" to compute on the fly if the interactions are stored in an index. Maybe computing those fields dynamically at read time was mentioned in the RFC and I missed it -- I still have some brain fog from jet lag after vacation.

Other uses?

At the risk of opening a can of worms, I'm wondering if such a proposal could help for other "session-based" search refinements. I'm thinking of an e-commerce application where someone searches for "black shoes", doesn't click on any search results, and then searches for "nike basketball shoes" -- you may want to rank the black shoes higher, on the assumption that the two queries are related.

If you include a user identifier in the conversation metadata, the system could provide a more personalized experience based on prior conversations with the user (subject to the usual privacy concerns where you would need to let the user delete some or all conversations). Probably out of scope, though.

There's been some discussion around interaction logging, incorporating the user's "post-search" actions (see https://github.com/opensearch-project/OpenSearch/issues/4619), which feels like it overlaps a bit, though a "one size fits all" solution probably wouldn't be ideal. Still, I'm wondering if there's some opportunity for reuse or at least sharing lessons learned.

jngz-es commented 11 months ago

Comparing with RFC-1151, the common part is a new plugin to store chat history.

The differences from RFC-1151 are:

  1. Having generic conversation APIs to support all conversational applications building, like conversational search, Chotbots etc.
  2. Using ReAct once to get results, the latency could be more under control, the search pipelines can be used as a tool in ReAct.

Basically I don't see major conflicts between these two RFCs from the implementation perspective, we can have both. We can have a new conversation plugin to store chat history, meanwhile provide chat API for applications. We can also have a new ml processor to run conversation/ml-commons APIs in search pipelines. Users can build conversational search in either way.

jonfritz commented 11 months ago

@jngz-es thanks for sharing this, and excited to get more into the details in the GenAI meeting on Friday. I would encourage the community to have one way to build a conversational application, unless we saw a true need to have multiple approaches. It'll make the developer experience easier to learn for users interested in building applications.

From proposal #1151, it seems like perhaps it could be split to make the idea more crisp (and renamed). For items that relate to building conversational search, we can use the comments on #1150 and iterate on that RFC to create the approach. It seems like the big, net new question in #1151 is whether OpenSearch should add the ability to create multi-agent architectures (in a similar direction to what LangChain does). I think this warrants a deeper discussion, as I wonder if OpenSearch should be trying to incorporate this versus having customers do this in their application stack, and let OpenSearch focus on a different set of primitives. By repurposing #1151 (and renaming it to channel this theme), I think we'd be able to more crisply outline each area and themes in the RFCs. Thoughts?

jngz-es commented 11 months ago

@jonfritz I agree we should have one way to build a conversational application. I believe conversational search is one of them. Looks like #1150 is specific for conversational search, what about others applications like Chatbots? If customers want to build Chatbots on OpenSearch, should we provide another framework to support it? I don't think so, as we should have one way to build conversational applications. What do you think?

jonfritz commented 11 months ago

@jngz-es clarifying question - how do you define a "chatbot" and how is that different than a conversational search interaction? From a customer perspective, I see customers wanting a natural language way to interact with their data stored in OpenSearch and leverage the generative aspects of LLMs to enrich and summarize those interactions and better understand the search query submitted (e.g. rewrites). We use the term "conversational search" to describe this, and a customer application could be considered a "chatbot" because it's a conversation with a natural language application. What use cases for natural language/chat interactions do you think would make sense for OpenSearch outside of this pattern?

HenryL27 commented 11 months ago

Comparing with #1151, another thing we'd like to have in common: prompt template management. Pretty much every conversational application will need some kind of prompt engineering, and this presents a good way to manage that at scale, so we'd love to incorporate some version of that into #1150.

I'll flesh out what I'm imagining in a little more detail than I think either RFC gives.

  1. Prompt templates are essentially just f-strings, so let's not overcomplicate
  2. Template lifecycle: first register the template like its a model. Then various components (pipeline, ml-predict) will invoke them. The invocations are specified in either the configuration or the parameters of the components. Templates can be updated as prompt engineering (being more art than science) should be highly iterable.
  3. The prompt invocation may be hidden from the user, so the user must know what placeholders to include. We can probably just publish this (or borrow an existing protocol if one exists)

example template: "Summarize this list of documents from opensearch: {doc_list}"

Am I missing anything here?

jngz-es commented 11 months ago

@jonfritz the use case I image is like a e-commerce customer using OpenSearch wants to build a chatbot for their customers to use to improve their customer experience on their e-commerce platform. It would be easy for OpenSearch user to build a chabot if we could support conversation-based application building.

jngz-es commented 11 months ago

@HenryL27 I agree. Actually whether we only support conversational search or other conversational applications, probably we need something similar with LangChain as a framework to support building conversational applications including conversational search. So from the implementation perspective, I don't see major conflicts.

jonfritz commented 11 months ago

@jngz-es interesting idea. I'm more interested in the specifics of how you see this chatbot being different from a conversational search interaction, though. Can you share a more detailed vision on what an eCommerce chatbot would do (e.g. what questions or commands it would respond to, and with what information)? FWIW - for me, it feels like a general chatbot application platform is outside the scope of how most customers would want to use OpenSearch. An arbitrary chat application (e.g. one that generates poetry) that's decoupled from the core OpenSearch purpose (accessing unstructured data) may be best suited for a different application stack. On the other hand, conversational search is more closely tied to OpenSearch, because it's a different way for customers to to interact with their data on the platform (through natural language search queries). I'd love to learn more about what your customers are asking for (and get into the details of a "chatbot"), and if they do want to build these types of apps in OpenSearch versus other methods - it'll be a good discussion for Friday's meeting.

jngz-es commented 11 months ago

@jonfritz a chatbot not only could provide conversational search results but also improve entire shopping experiences from different perspectives. On top of search results, customers could have questions about products comparing, any coupons combination recommendation, products combination discount, return/refund policy, etc., basically something like specific knowledge stored in OpenSearch.

hijakk commented 11 months ago

@austintlee - following up conversation from the Zoom session on July 28 - I have an image search use case where it would be really useful to send a base64 encoded image to a search pipeline. That pipeline would reach out to an external service to vectorize that image, and leverage the resulting vector in a KNN search. I could see this being useful in ingest pipelines as well, but the resource usage associated with this could be pretty intense in the image specific usecase.

Supporting image inputs for conversational/generative AI would be extremely powerful in general, so I'm hopeful that'd be included in future solutions.

ylwu-amzn commented 11 months ago

@hijakk , This is a good suggestion, I created an issue https://github.com/opensearch-project/ml-commons/issues/1163, feel free to discuss the multi-modal model support on that issue.

HenryL27 commented 11 months ago

1161 and #1150 represent rather different philosophies in terms of integrating GenAI into OpenSearch. #1161 seeks to provide a framework for building conversational apps that happen to use some OpenSearch features. #1150 seeks to provide a conversational interface over your favorite search engine. Both are valid. Neither should be core OpenSearch features, and furthermore, I think that neither belong in ML-Commons. ML-Commons is for training and predicting with ML models, while these RFCs are for building Generative AI applications.

Accordingly, I’d like to call for the creation of an ‘AI-Commons’ plugin as an extension to ML-Commons. #1150 and #1161 will look pretty similar, code-wise, so I imagine it should be pretty easy to share a codebase. Both need conversational memory of some form; both need prompt templating of some form.

Why do we want both? I imagine developers picking AI-Commons up for the RAG of #1150 - this will provide a good starting point for people looking to spice up their existing search application with some GenAI pizzazz. In many use-cases, this will be sufficient. But gradually, these conversational search apps will acquire peculiarities and requirements that the RAG pipeline might not support. Then #1161’s CoT will be required, and these apps will cross a line where they stop being fancy search apps and start being fancy AI apps. Therefore it should be easy to go from RAG to CoT - RAG should be in the CoT ecosystem, but should also be able to stand alone.

As an example, what does answering the question “What happens if a ship jackknifes in the Suez Canal?” entail? RAG will try to answer in a single query (granted, with some potentially clever query rewriting), but unless there’s a document detailing the answer to this question, RAG is hopeless. CoT, however, will ask a series of queries, one step at a time, to build up and derive an answer. For example - “What are the major trade routes through the Suez Canal?”, “What are the shipping routes from Oman to America?”, “How long are they?”, “What products does this particular ship carry?”, “What is the demand and backlog of this particular product?”, etc.

Great! Well, if CoT is so powerful, why are we bothering with RAG? A couple of reasons. 1/ RAG is much simpler. Personally, I prefer using predictable tools that I understand. I know exactly what RAG is going to do: query OpenSearch, then pipe the results into an LLM for synthesis. I don’t know what a CoT agent is going to do, given an arbitrary query. That’s what makes it so powerful - that’s gets to choose how to answer - but I don’t quite trust it to do the right thing. And trust is everything when it comes to GenAI adoption. So if we let RAG build up trust in the system, then people will be more comfortable switching to CoT. 2/ RAG is closer to search. OpenSearch users want to do search. Throwing a whole CoT GenAI infrastructure at someone who just wants to do search is going to alienate them. But a GenAI interface (RAG) over their search, maybe that will be easier to stomach. Finally, 3/ RAG is probably cheaper, cost- and performance-wise - only one or two LLM inferences instead of several for every query.

So both #1150 and #1161 should happen, imo, and in the same place. How can we combine our efforts? The absolute first thing is that we all need to be aware of what code already exists. I’ve published some code, and I would urge everyone else here to do the same. We don’t all work together, so if we want to work together, we need to be able to see each other’s work. As far as integrating the RFCs into one project - first I’m gonna vote to separate agents from models (the first option from #1161). Then my proposed plan:

  1. 1150 will proceed as planned - conversational memory, RAG pipelines and API - in AI-Commons (or whatever we call the conversational AI plugin)

  2. Everyone seems to agree on prompt templates; some implementation of that will exists.
  3. 1161 CoT will continue as planned, also in AI-Commons, with a RAG tool that uses the RAG pipeline

In general with CoT, I think we don’t want to give the LLM too many options. We should try to keep as much complexity within the tools as possible, and focus on giving them clean and intuitive interfaces - LLMs are basically pure intuition.

I hope this plan is agreeable to people.

p.s. can we resolve #1151? It looks like #1161 and #1150 partition it.

sean-zheng-amazon commented 11 months ago

Really like the proposal from @HenryL27. We can keep #1150 focus on conversation plugin and rag pipeline; and use #1161 to track agent framework. We've closed #1151, will split the content into this and #1161.

Meanwhile a couple of follow up questions:

asfoorial commented 11 months ago

Nice features mention in this RFC. I suggest to keep the door open for LLM hosting as there is a trend to get LLMs smaller with quantization yet achieve reasonable performance. I would say they will be hostable in ml nodes or other dedicated nodes.

HenryL27 commented 11 months ago

@asfoorial thanks! We're not intending to preclude LLM hosting inside OpenSearch - it's just not a goal of this RFC. Basically, any LLM (can you call it a LLM if it's smaller? maybe just seq2seq?) that the ML-Commons model framework allows use of should work with this.

austintlee commented 11 months ago

What does the community think about putting the work that we are proposing in this RFC in ml-commons? The Conversation and Memory APIs and search processors.

asfoorial commented 11 months ago

preclude

@HenryL27 Yes they are called LLMs. Examples are listed here https://huggingface.co/TheBloke/Llama-2-7B-GGML/tree/main for Llama 2. Some are relatively small in size and resource consumption and yet they perform well. GPT4ALL also has a pretty good collection that keeps growing, thanks to quantization.

Also, the GPT4ALL java binding here https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/java would make it easily possible to host quantized LLMs to run right from within OpenSearch ML Nodes and even on CPUs with reasonably performance. They have pretty good Falcon which is ~4GB and consumes 8GB of RAM and have other smaller models that would consume 4GB of RAM.

I am looking forward to see OpenSearch thriving with all these latest advancements.

ylwu-amzn commented 11 months ago

What does the community think about putting the work that we are proposing in this RFC in ml-commons? The Conversation and Memory APIs and search processors.

hi, @austintlee , we are working on creating a new repo for this. Will share it here when it's ready.

mashah commented 11 months ago

Yailing @ylwu-amzn,

Based on the discussion above, it seems that we should be putting this work in ml-commons. We are working in that direction.

sean-zheng-amazon commented 11 months ago

@austintlee / @mashah / @ylwu-amzn, just want to make sure we are on the same page:

mashah commented 11 months ago

@sean-zheng-amazon

The conversation history API is meant to be more general than RAG. RAG is a good start, but there's a lot more one can do, especially with AI. Our recommendation is to include conversation history in ml-commons. We'd like it to be used with what is being proposed in ml-commons.

HenryL27 commented 11 months ago

Can someone on the AWS side take a look at my CRUD Conversational Memory implementation? I want to be able to get it out quickly since a lot of things will depend on it. aryn-ai/conversational-opensearch Still working on the access controls.

asfoorial commented 11 months ago

Adding to the above, I would like to see OpenSearch capable of running all its features independently and offline without the need for external systems, yet can integrate with them. More specifically, having a default set of LLMs running inside ml-common just like the sbert models.

mashah commented 11 months ago

@asfoorial +1

That's what we're going for.

ylwu-amzn commented 11 months ago

Can someone on the AWS side take a look at my CRUD Conversational Memory implementation? I want to be able to get it out quickly since a lot of things will depend on it. aryn-ai/conversational-opensearch Still working on the access controls.

@jngz-es can you help take a look

austintlee commented 11 months ago

@sean-zheng-amazon I think it's important that we have a consistent way of accessing Conversational Memory (CM) and we believe agents created through the proposed Agent Framework will need to integrate with CM. That's why we think CM belongs in ml-commons.

ylwu-amzn commented 11 months ago

I think it will be a clean way to build conversation plugin to manage the conversation related data. That will be something reusable for other components not just ml-commons.

mashah commented 11 months ago

@ylwu-amzn @sean-zheng-amazon

Allow me to repeat in my words: you want to separate stateful features like conversational history and memory from stateless code like the Agent Framework.

We are suggesting keeping them together in ml-commons so: 1/ we don't duplicate functionality in ml-commons and in the separate plug-in and 2/ users have a clean experience -- they don't need to download separate pieces to make it work.

Is the area that you're creating for conversational history and memory a part of the default bundle with ml-commons? If so, then at least we mitiage #2, and we also get what you want which is separation of concerns in the code base.

sean-zheng-amazon commented 11 months ago

Here are my 2 cents:

@mashah I figured it's probably better to have a meeting to discuss this and update here, thoughts?

elfisher commented 11 months ago

@ylwu-amzn @sean-zheng-amazon

Allow me to repeat in my words: you want to separate stateful features like conversational history and memory from stateless code like the Agent Framework.

We are suggesting keeping them together in ml-commons so: 1/ we don't duplicate functionality in ml-commons and in the separate plug-in and 2/ users have a clean experience -- they don't need to download separate pieces to make it work.

Is the area that you're creating for conversational history and memory a part of the default bundle with ml-commons? If so, then at least we mitiage #2, and we also get what you want which is separation of concerns in the code base.

I also agree with these points. We should try to avoid capability duplication and ensure if we are building something that everyone can benefit from it is included in the default bundle.

ylwu-amzn commented 11 months ago

Thanks Mehul, I think these are valid points. These are my thought:

  1. We will design a general memory layer in ml-commons, which defines interface of memory like save, retrieve etc.
  2. Memory could be in kinds of implementation: conversation history could be some long term memory, we may also allow user to use other way to build memory like Redis, DDB. Separate these things could make the architecture extensible. But people may concern this is too over-engineering. We may leave it to later phase. To me, this can have benefit to give user more flexibility.
elfisher commented 11 months ago

It sounds like we are addressing the first issue, can we clarify if we can include this in the default project release bundle? It seems like it will be useful for many users.

jonfritz commented 11 months ago

From the notes in this thread, it seems like the conversational memory described in this RFC should be added in ml-commons. If the aim for ml-commons is to provide these building blocks and evolve it over time with more features, not clear why we would start in a plugin elsewhere. We could always pull it out later into a separate plugin if the community decides to change perspective and minimize the surface area of ml-commons at a future date. But, given the chatter on this RFC and Slack, seems like we should start by adding this code to ml-commons. Are we aligned here?

elfisher commented 11 months ago

From the notes in this thread, it seems like the conversational memory described in this RFC should be added in ml-commons. If the aim for ml-commons is to provide these building blocks and evolve it over time with more features, not clear why we would start in a plugin elsewhere. We could always pull it out later into a separate plugin if the community decides to change perspective and minimize the surface area of ml-commons at a future date. But, given the chatter on this RFC and Slack, seems like we should start by adding this code to ml-commons. Are we aligned here?

That makes sense to me. I don't think we want to implement memory management twice.

austintlee commented 11 months ago

@ylwu-amzn A "general memory layer" as you put it is precisely what we are proposing in this RFC (it's one of key components of conversations). If you think it should be in ml-commons, let's do that. If you want to augment the current proposal to include room for other storage types, we can do that here.

ylwu-amzn commented 11 months ago

@ylwu-amzn A "general memory layer" as you put it is precisely what we are proposing in this RFC (it's one of key components of conversations). If you think it should be in ml-commons, let's do that. If you want to augment the current proposal to include room for other storage types, we can do that here.

I mean not just ml-commons, other components also needs this conversation/memory layer. ml-commons can use this, but not necessary to build the whole layer to ml-commons. I think it can be reused by other components/plugins too. So keep a separate plugin can make the architecture clear. For example, some plugin like Alerting may need a memory layer too but they don't need ML. Why they have to add ml-commons as dependency? They can just depend on the dedicated conversation/memory thing.