[PROPOSAL] Add Cohere LLM functionality

tianjing-li commented 8 months ago

What/Why

What are you proposing?

In a few sentences, describe the feature and its core capabilities.

Integrate Cohere generative AI capabilities into the Opensearch python client. This will involve work over several PRs - the aim of this issue is mainly to highlight at a very high-level the overall work that would be required.

The integration would cover the following:

Integrate Cohere's python SDK and add functionality for Embed, Rerank, Chat, and RAG. If it is better to add API calls directly, let me know in the comments
Update the documentation to cover usage and all features added over time, including examples
If needed, add tests to cover use-cases
(If possible) Add tracking to calls made to Cohere. If using the client, this would require instantiating it with a client_name string value for opensearch. If using API requests, we would like to add a Request-Source header to all requests.
Take ownership and resolve any bugs discovered along the way

What users have asked for this feature?

Highlight any research, proposals, requests or anecdotes that signal this is the right thing to build. Include links to GitHub Issues, Forums, Stack Overflow, Twitter, Etc,

Cohere has an existing partnership with Opensearch, and existing guides to integrate Opensearch and Cohere by manually registering a connector, and enable semantic search with Cohere.

This integration would aim to make it easier to use generative AI features.

What problems are you trying to solve?

Summarize the core use cases and user problems and needs you are trying to solve. Describe the most important user needs, pain points and jobs as expressed by the user asks above. Template: When \ , a \ wants to \, so they can \. (Example: When searching by postal code, a buyer wants to be required to enter a valid code so they don’t waste time searching for a clearly invalid postal code.)

To allow Opensearch Python users to integrate generative AI with their instances in a more seamless manner.

What is the developer experience going to be?

Does this have a REST API? If so, please describe the API and any impact it may have to existing APIs. In a brief summary (not a spec), highlight what new REST APIs or changes to REST APIs are planned. as well as any other API, CLI or Configuration changes that are planned as part of this feature.

Ideally, classes would be created within your SDK, for example, CohereEmbed, that could then be imported through the SDK. Then the user would call the methods they require. These classes/sets of functionality would be grouped within a submodule.

If adding rest APIs, we would add the following:

Are there any security considerations?

Describe if the feature has any security considerations or impact. What is the security model of the new APIs? Features should be integrated into the OpenSearch security suite and so if they are not, we should highlight the reasons here.

API Key: Instantiating Cohere's client requires an API key that the end user will have to manage.

User data: From Cohere's dashboard, an Admin user (this user will be required to create an API key, so you will have one by default) can opt out of data collection, so none of the data they send to our APIs will be stored. Here is our formal data use policy

Are there any breaking changes to the API

If this feature will require breaking changes to any APIs, ouline what those are and why they are needed. What is the path to minimizing impact? (example, add new API and deprecate the old one)

No

What is the user experience going to be?

Describe the feature requirements and or user stories. You may include low-fidelity sketches, wireframes, APIs stubs, or other examples of how a user would use the feature via CLI, OpenSearch Dashboards, REST API, etc. Using a bulleted list or simple diagrams to outline features is okay. If this is net new functionality, call this out as well.

Open to feedback from more experienced contributors on how best the integration should be used. To start, a submodule the user can import.

Are there breaking changes to the User Experience?

Will this change the existing user experience? Will this be a breaking change from a user flow or user experience perspective?

No

Why should it be built? Any reason not to?

Describe the value that this feature will bring to the OpenSearch community, as well as what impact it has if it isn't built, or new risks if it is. Highlight opportunities for additional research.

The Opensearch community would greatly benefit from this proposed integration with Cohere's platform - bringing modern LLM capabilities to the feature table. It would allow Opensearch end users to leverage generative AI and integrate it with any of their tools more easily.

What will it take to execute?

Describe what it will take to build this feature. Are there any assumptions you may be making that could limit scope or add limitations? Are there performance, cost, or technical constraints that may impact the user experience? Does this feature depend on other feature work? What additional risks are there?

I would need deeper understanding, especially from the existing community to understand the impact of my proposed changes. Questions I have I will put in the next section, from there it will paint a clearer picture of overall scope and work required.

Any remaining open questions?

What are known enhancements to this feature? Any enhancements that may be out of scope but that we will want to track long term? List any other open questions that may need to be answered before proceeding with an implementation.

What is the best way to implement this integration to Opensearch's Python SDK? How should this be structured architecturally? Are there existing tools within the SDK that can be leveraged? Examples from the Langchain integration: Embed ReRank Chat RAG
What are potential gotchas/roadblocks we could run into along the way?
What is the best testing methodology for added functionality? We will of course be forking the repo, would testing added features with a local Opensearch Docker image suffice? Are there existing unit or integration tests we should run?

dblock commented 8 months ago

Is Cohere implemented as an OpenSearch plugin (doesn't seem that way?), and what RESTful endpoints does it expose in OpenSearch (doesn't seem to expose any)? If it's an AWS feature then I think it could be added into a new .aws. namespace or something like that, but we need to make sure it's loaded optionally, but I am not sure what the advantage is from having a specialized client.

dblock commented 8 months ago

Related to testing functionality that's not available in docker/open source, we have https://github.com/opensearch-project/opensearch-py/issues/382 open, but I would say we will need to have at least some level of testing that can be done offline. This would be a good time to introduce something like VCR.

tianjing-li commented 8 months ago

To specify, Cohere isn't built on top of Opensearch, we have our own API infrastructure, the main advantage of adding an integrated client into the Opensearch SDK would be to remove friction in setting up LLM capabilities if you have an Opensearch instance. Would this make sense to be added in the /plugins/ module?

For testing, agreed, it would be good to be able to write tests that can be run. I'll look into VCR

dblock commented 8 months ago

Would this make sense to be added in the /plugins/ module?

If it's not a plugin, then likely not.

I feel like Cohere + OpenSearch is the same story as LangChain + OpenSearch. OpenSearch being nothing more than a vector database component of an end-to-end solution. In that case, does it make more sense to instead add generic support for OpenSearch (or another vector database) to the Cohere API, and use opensearch-py in there?

Other option is opensearch-py-cohere, a new component built on top of this opensearch-py. Lots of advantages like your own release cycle, and compatibility with many versions. We have https://github.com/opensearch-project/opensearch-py-ml that's basically that, but we also have opened https://github.com/opensearch-project/opensearch-py-ml/issues/372 ;)

I definitely think we should all work backwards from what users want. So let's keep that in mind.

tianjing-li commented 8 months ago

Those are good questions and considerations - Let me chat with the team and get back to you with an informed answer.

tianjing-li commented 8 months ago

We do have existing support for Opensearch, Cohere offers connectors that can be deployed for different data providers, including Opensearch.

Regarding the opensearch-py-cohere suggestion, just to make sure I understand correctly, this would not involve forking the existing opensearch-py repository but instead be a standalone repo (would it live in opensearch-project?) that would contain all the added functionality, that users would then install through pip, is that correct?

dblock commented 8 months ago

We do have existing support for Opensearch, Cohere offers connectors that can be deployed for different data providers, including Opensearch.

I encourage you to think in terms of "what absolutely minimal code does a developer want to write to interact with OpenSearch + Cohere?". And we can easily work backwards from there to figure out where the code needs to live.

Regarding the opensearch-py-cohere suggestion, just to make sure I understand correctly, this would not involve forking the existing opensearch-py repository but instead be a standalone repo (would it live in opensearch-project?) that would contain all the added functionality, that users would then install through pip, is that correct?

Yes, correct.

tianjing-li commented 8 months ago

Agreed, I believe the less friction involved the better.

The standalone add-on client sounds like a viable solution to me. How would we proceed with creating the repo under the opensearch-project directory? And would there need to be some review process by other Opensearch contributors or it would be essentially self-managed by its authors?

dblock commented 8 months ago

The standalone add-on client sounds like a viable solution to me. How would we proceed with creating the repo under the opensearch-project directory?

Maybe start on your own GitHub/org? If you have something viable we can easily move it into opensearch-project org before/after the first release, that requires a bit of process right now (we've done it with a handful of repos) and I'd rather not block you.

And would there need to be some review process by other Opensearch contributors or it would be essentially self-managed by its authors?

Everything in opensearch-project will have to follow the rules in https://github.com/opensearch-project/.github, and especially important for security-related incidents. But it can be 100% managed by maintainers and original repo authors preserve admin rights on the repo.

tianjing-li commented 8 months ago

Thank you Daniel for your input - we've tested some existing functionality within the ml-commons repository to integrate Cohere and it works, but is quite convoluted so we will go into the add-on client route in our own org.

tianjing-li commented 7 months ago

@dblock Closing this. For now, we've decided to update the ml-commons connectors to support Cohere models

opensearch-project / opensearch-py