irthomasthomas / undecidability

13 stars 2 forks source link

cohere-ai/quick-start-connectors: code for integrating workplace datastores with Cohere's LLMs to perform RAG #664

Open irthomasthomas opened 9 months ago

irthomasthomas commented 9 months ago

TITLE

cohere-ai/quick-start-connectors: This open-source repository offers reference code for integrating workplace datastores with Cohere's LLMs, enabling developers and businesses to perform seamless retrieval-augmented generation (RAG) on their own data.

DESCRIPTION

Quick Start Connectors

Table of Contents

Overview Cohere's Build-Your-Own-Connector framework allows you to integrate Cohere's Command LLM via the Chat api endpoint to any datastore/software that holds text information and has a corresponding search endpoint exposed in its API. This allows the Command model to generated responses to user queries that are grounded in proprietary information.

Some examples of the use-cases you can enable with this framework:

This open-source repository contains code that will allow you to get started integrating with some of the most popular datastores. There is also an empty template connector which you can expand to use any datasource. Note that different datastores may have different requirements or limitations that need to be addressed in order to to get good quality responses. While some of the quickstart code has been enhanced to address some of these limitations, others only provide the basics of the integration, and you will need to develop them further to fit your specific use-case and the underlying datastore limitations.

Please read more about our connectors framework here: https://docs.cohere.com/docs/connectors

URL

https://github.com/cohere-ai/quick-start-connectors

Suggested labels

{'label-name': 'data-integration', 'label-description': "Involves integrating Cohere's LLMs with various data sources for retrieval-augmented generation.", 'confidence': 63.24}

irthomasthomas commented 9 months ago

Related issues

553: Cohere llm api: Retrieval Augmented Generation (RAG)

### DetailsSimilarity score: 0.88 - [ ] [Retrieval Augmented Generation (RAG)](https://docs.cohere.com/docs/retrieval-augmented-generation-rag) # Retrieval Augmented Generation (RAG) Retrieval Augmented Generation (RAG) is a method for generating text using additional information fetched from an external data source. Providing relevant documents to the model can greatly increase the accuracy of the response. The Chat API in combination with the Command model makes it easy to generate text that is grounded on supplementary information. For example, the code snippet below will produce an answer to "Where do the tallest penguins live?" along with inline citations based on the provided documents. [More about Retrieval Augmented Generation (RAG)](https://docs.cohere.com/docs/retrieval-augmented-generation-rag) #### Suggested labels #### {'label-name': 'text-generation-method', 'label-description': 'Method for generating text using external information', 'confidence': 54.99}

396: astra-assistants-api: A backend implementation of the OpenAI beta Assistants API

### DetailsSimilarity score: 0.85 - [ ] [datastax/astra-assistants-api: A backend implementation of the OpenAI beta Assistants API](https://github.com/datastax/astra-assistants-api) Astra Assistant API Service ============================= A drop-in compatible service for the OpenAI beta Assistants API with support for persistent threads, files, assistants, messages, retrieval, function calling and more using AstraDB (DataStax's db as a service offering powered by Apache Cassandra and jvector). Compatible with existing OpenAI apps via the OpenAI SDKs by changing a single line of code. Getting Started --------------- 1. **Create an Astra DB Vector database** 2. Replace the following code: ```python client = OpenAI( api_key=OPENAI_API_KEY, ) ``` with: ```python client = OpenAI( base_url="https://open-assistant-ai.astra.datastax.com/v1", api_key=OPENAI_API_KEY, default_headers={ "astra-api-token": ASTRA_DB_APPLICATION_TOKEN, } ) ``` Or, if you have an existing astra db, you can pass your db\_id in a second header: ```python client = OpenAI( base_url="https://open-assistant-ai.astra.datastax.com/v1", api_key=OPENAI_API_KEY, default_headers={ "astra-api-token": ASTRA_DB_APPLICATION_TOKEN, "astra-db-id": ASTRA_DB_ID } ) ``` 3. **Create an assistant** ```python assistant = client.beta.assistants.create( instructions="You are a personal math tutor. When asked a math question, write and run code to answer the question.", model="gpt-4-1106-preview", tools=[{"type": "retrieval"}] ) ``` By default, the service uses AstraDB as the database/vector store and OpenAI for embeddings and chat completion. Third party LLM Support ----------------------- We now support many third party models for both embeddings and completion thanks to litellm. Pass the api key of your service using `api-key` and `embedding-model` headers. For AWS Bedrock, you can pass additional custom headers: ```python client = OpenAI( base_url="https://open-assistant-ai.astra.datastax.com/v1", api_key="NONE", default_headers={ "astra-api-token": ASTRA_DB_APPLICATION_TOKEN, "embedding-model": "amazon.titan-embed-text-v1", "LLM-PARAM-aws-access-key-id": BEDROCK_AWS_ACCESS_KEY_ID, "LLM-PARAM-aws-secret-access-key": BEDROCK_AWS_SECRET_ACCESS_KEY, "LLM-PARAM-aws-region-name": BEDROCK_AWS_REGION, } ) ``` and again, specify the custom model for the assistant. ```python assistant = client.beta.assistants.create( name="Math Tutor", instructions="You are a personal math tutor. Answer questions briefly, in a sentence or less.", model="meta.llama2-13b-chat-v1", ) ``` Additional examples including third party LLMs (bedrock, cohere, perplexity, etc.) can be found under `examples`. To run the examples using poetry: 1. Create a `.env` file in this directory with your secrets. 2. Run: ```shell poetry install poetry run python examples/completion/basic.py poetry run python examples/retreival/basic.py poetry run python examples/function-calling/basic.py ``` ### Coverage See our coverage report [here](your-coverage-report-link). ### Roadmap - Support for other embedding models and LLMs - Function calling - Pluggable RAG strategies - Streaming support #### Suggested labels #### { "key": "llm-function-calling", "value": "Integration of function calling with Large Language Models (LLMs)" }

644: cohereai_classify table | CohereAI plugin | Steampipe Hub

### DetailsSimilarity score: 0.85 - [ ] [cohereai_classify table | CohereAI plugin | Steampipe Hub](https://hub.steampipe.io/plugins/mr-destructive/cohereai/tables/cohereai_classify) # TITLE: cohereai_classify table | CohereAI plugin | Steampipe Hub **DESCRIPTION:** Overview 8Tables Versions GitHub steampipe plugin install mr-destructive/cohereai **cohereai_classify** cohereai_detect_language cohereai_detokenize cohereai_embed cohereai_generation cohereai_summaraize cohereai_summarize cohereai_tokenize **ON THIS PAGE** Examples Schema **GET INVOLVED** Edit on GitHub Discuss on Slack **Table: cohereai_classify** Get classification for a given input strings and examples. Notes: - A inputs is a list of strings to classify.(max 96 strings) - A examples is a list of {"text": "apple", "label": "fruit"} structure of type Example - Minimum 2 examples should be provided and the maximum value is 2500 with each example of maximum of 512 tokens. **Examples** Basic classification with given set of inputs and examples ```sql select classification from cohereai_classify where inputs = '["apple", "blue", "pineapple"]' and examples = '[{"text": "apple", "label": "fruit"}, {"text": "green", "label": "color"}, {"text": "grapes", "label": "fruit"}, {"text": "purple", "label": "color"}]'; ``` Classification with specific settings(model, preset) ```sql select classification from cohereai_classify where settings = '{ "model": "embed - multilingual - v2.0" }' and inputs = '["Help!", "Call me when you can"]' and examples = '[{"text": "Help!", "label": "urgent"}, {"text": "SOS", "label": "urgent"}, {"text": "Call me when you can", "label": "not urgent"}, {"text": "Talk later?", "label": "not urgent"}]'; ``` Email Spam Classification ```sql select classification from cohereai_classify where inputs = '["Confirm your email address", "hey i need u to send some $"]' and examples = '[{"label": "Spam", "text": "Dermatologists don't like her!"}, {"label": "Spam", "text": "Hello, open to this?"}, {"label": "Spam", "text": "I need help please wire me $1000 right now"}, {"label": "Spam", "text": "Hot new investment, don't miss this!"}, {"label": "Spam", "text": "Nice to know you ;)"}, {"label": "Spam", "text": "Please help me?"}, {"label": "Not spam", "text": "Your parcel will be delivered today"}, {"label": "Not spam", "text": "Review changes to our Terms and Conditions"}, {"label": "Not spam", "text": "Weekly sync notes"}, {"label": "Not spam", "text": "Re: Follow up from today's meeting"}, {"label": "Not spam", "text": "Pre-read for tomorrow"}]'; ``` **Schema for cohereai_classify** | Name | Type | Operators | Description | |---------------|------------------|-----------|--------------------------------------------------------------| | _ctx | jsonb | | Steampipe context in JSON form, e.g. connection_name. | | classification| text | | The classification results for the given input text(s). | | confidence | double precision | | The confidence score of the classification. | | examples | text | | The example text classified. | | id | text | | The ID of the classification. | | inputs | text | | The input text that was classified. | | labels | jsonb | | The labels of the classification. | | settings | jsonb | | Settings is a JSONB object that accepts any of the classify API request parameters. | **URL:** [cohereai_classify table | CohereAI plugin | Steampipe Hub](https://hub.steampipe.io/plugins/mr-destructive/cohereai/tables/cohereai_classify) #### Suggested labels ####

315: A Cheat Sheet and Some Recipes For Building Advanced RAG | by Andrei | Jan, 2024 | LlamaIndex Blog

### DetailsSimilarity score: 0.84 - [ ] [A Cheat Sheet and Some Recipes For Building Advanced RAG | by Andrei | Jan, 2024 | LlamaIndex Blog](https://blog.llamaindex.ai/a-cheat-sheet-and-some-recipes-for-building-advanced-rag-803a9d94c41b) A comprehensive RAG Cheat Sheet detailing motivations for RAG as well as techniques and strategies for progressing beyond Basic or Naive RAG builds. (high-resolution version) It’s the start of a new year and perhaps you’re looking to break into the RAG scene by building your very first RAG system. Or, maybe you’ve built Basic RAG systems and are now looking to enhance them to something more advanced in order to better handle your user’s queries and data structures. In either case, knowing where or how to begin may be a challenge in and of itself! If that’s true, then hopefully this blog post points you in the right direction for your next steps, and moreover, provides for you a mental model for you to anchor your decisions when building advanced RAG systems. The RAG cheat sheet shared above was greatly inspired by a recent RAG survey paper (“Retrieval-Augmented Generation for Large Language Models: A Survey” Gao, Yunfan, et al. 2023). Basic RAG Mainstream RAG as defined today involves retrieving documents from an external knowledge database and passing these along with the user’s query to an LLM for response generation. In other words, RAG involves a Retrieval component, an External Knowledge database and a Generation component. LlamaIndex Basic RAG Recipe: from llama_index import SimpleDirectoryReader, VectorStoreIndex # load data documents = SimpleDirectoryReader(input_dir="...").load_data() # build VectorStoreIndex that takes care of chunking documents # and encoding chunks to embeddings for future retrieval index = VectorStoreIndex.from_documents(documents=documents) # The QueryEngine class is equipped with the generator # and facilitates the retrieval and generation steps query_engine = index.as_query_engine() # Use your Default RAG response = query_engine.query("A user's query") #### Suggested labels #### { "key": "RAG-Building", "value": "Techniques and strategies for building advanced Retrieval Augmented Generation systems for language models" }

311: Introduction | Mistral AI Large Language Models

### DetailsSimilarity score: 0.84 - [ ] [Introduction | Mistral AI Large Language Models](https://docs.mistral.ai/) Mistral AI currently provides two types of access to Large Language Models: An API providing pay-as-you-go access to our latest models, Open source models available under the Apache 2.0 License, available on Hugging Face or directly from the documentation. Where to start? API Access Our API is currently in beta to ramp up the load and provide good quality of service. Access the platform to join the waitlist. Once your subscription is active, you can immediately use our chat endpoint: curl --location "https://api.mistral.ai/v1/chat/completions" \ --header 'Content-Type: application/json' \ --header 'Accept: application/json' \ --header "Authorization: Bearer $MISTRAL_API_KEY" \ --data '{ "model": "mistral-tiny", "messages": [{"role": "user", "content": "Who is the most renowned French painter?"}] }' Or our embeddings endpoint: curl --location "https://api.mistral.ai/v1/embeddings" \ --header 'Content-Type: application/json' \ --header 'Accept: application/json' \ --header "Authorization: Bearer $MISTRAL_API_KEY" \ --data '{ "model": "mistral-embed", "input": ["Embed this sentence.", "As well as this one."] }' For a full description of the models offered on the API, head on to the model docs. For more examples on how to use our platform, head on to our platform docs. Raw model weights Raw model weights can be used in several ways: For self-deployment, on cloud or on premise, using either TensorRT-LLM or vLLM, head on to Deployment For research, head-on to our reference implementation repository, For local deployment on consumer grade hardware, check out the llama.cpp project or Ollama. Get Help Join our Discord community to discuss our models and talk to our engineers. Alternatively, reach out to our business team if you have enterprise needs, want more information about our products or if there are missing features you would like us to add. Contributing Mistral AI is committed to open source software development and welcomes external contributions. Please open a PR! #### Suggested labels #### { "key": "llm-api", "value": "Accessing Large Language Models through the Mistral AI API" }

386: SciPhi/AgentSearch-V1 · Datasets at Hugging Face

### DetailsSimilarity score: 0.84 - [ ] [SciPhi/AgentSearch-V1 · Datasets at Hugging Face](https://huggingface.co/datasets/SciPhi/AgentSearch-V1) #### Getting Started The AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas! To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code: ```python from datasets import load_dataset import json import numpy as np # To stream the entire dataset: ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", streaming=True) # Optional, stream just the "arxiv" dataset # ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", data_files="arxiv/*", streaming=True) # To process the entries: for entry in ds: embeddings = np.frombuffer( entry['embeddings'], dtype=np.float32 ).reshape(-1, 768) text_chunks = json.loads(entry['text_chunks']) metadata = json.loads(entry['metadata']) print(f'Embeddings:\n{embeddings}\n\nChunks:\n{text_chunks}\n\nMetadata:\n{metadata}') break ``` A full set of scripts to recreate the dataset from scratch can be found [here](https://github.com/SciPhi-AI/agent-search/tree/main/scripts). Further, you may check the docs for details on how to perform RAG over AgentSearch. #### Languages English. #### Dataset Structure The raw dataset structure is as follows: ```json { "url": ..., "title": ..., "metadata": {"url": "...", "timestamp": "...", "source": "...", "language": "..."}, "text_chunks": ..., "embeddings": ..., "dataset": "book" | "arxiv" | "wikipedia" | "stack-exchange" | "open-math" | "RedPajama-Data-V2" } ``` #### Dataset Creation This dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets. To cite our work, please use the following: @software{SciPhi2023AgentSearch, author = {SciPhi}, title = {AgentSearch [ΨΦ]: A Comprehensive Agent-First Framework and Dataset for Webscale Search}, year = {2023}, url = {https://github.com/SciPhi-AI/agent-search} } #### Source Data @ONLINE{wikidump, author = "Wikimedia Foundation", title = "Wikimedia Downloads", url = "https://dumps.wikimedia.org" } @misc{paster2023openwebmath, title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text}, author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba}, year={2023}, eprint={2310.06786}, archivePrefix={arXiv}, primaryClass={cs.AI} } @software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {https://github.com/togethercomputer/RedPajama-Data} } #### License Please refer to the licenses of the data subsets you use. - Open-Web ([Common Crawl Foundation Terms of Use](https://commoncrawl.org/terms-of-use/)) - Books: the\_pile\_books3 license and pg19 license - ArXiv Terms of Use - Wikipedia License - StackExchange license on the Internet Archive #### Suggested labels #### { "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }