Initial plugin design - Githubissues

simonw commented 7 months ago

First version of this plugin will add a llm rag set of commands which can run RAG question answering against embedding collections in LLM.

simonw commented 7 months ago

Here's how to run a similarity search against an existing collection in Python:

>>> import llm, sqlite_utils 
>>> db = sqlite_utils.Database("/Users/simon/Library/Application Support/io.datasette.llm/embeddings.db")
>>> tweets = llm.Collection("tweets", db)
>>> tweets.similar("angry", 3)
[Entry(id='1285579425', score=0.43602250273712995, content='\ue05a', metadata=None), Entry(id='1572249463393259521', score=0.43299700703342975, content='I\'m not sure angry artists will be reassured if presented with this message though:\n\n"No you don\'t understand, your art wasn\'t used to train the model! It was used to train the model that we used to train the model - that\'s completely different!"', metadata=None), Entry(id='1529134917166059521', score=0.40047519336364085, content='@philgyford It definitely shows replies from people you follow first, but beyond that I don\'t know what kind of tricks they are pulling\n\nI imagine it\'s "optimized for engagement", which likely means that the most controversial anger-bait frequently rises to the top!', metadata=None)]

simonw commented 7 months ago

Initial command design:

# Answer a question against a specific collection
llm rag "what is shot scraper?" -c blog

I like the idea of being able to set a default collection so you can do this:

llm rag default-collection tweets
llm rag "What happened with Djng?"

And maybe set a default model used for the LLM question answering too:

llm rag default-model gpt-4-turbo

Might want to configure extra things on a per-collection basis:

How many items to consider?
What LLM to use?

simonw commented 7 months ago

Open question: should you be able to run RAG against items in multiple collections, provided they used the same embedding model?

I'm inclined to say no for the moment, it feels a bit too complex and wouldn't be used very often.

simonw commented 7 months ago

Idea: if you only have one collection that's treated as the default model automatically. Or maybe even treat the first ever created collection as the default.

simonw commented 7 months ago

Here's a tricky detail: for https://til.simonwillison.net/llms/embed-paragraphs E5-large-v2 there is an extra step: the question should be turned into query: what is LLM? and the returned content should have the passage: prefix stripped.

Not sure how best to handle that.

simonw commented 7 months ago

Testing it like this, against a copy of my til.db database:

llm embed-multi \
  tils \
  -d embeddings.db \
  --attach content til.db \
  --sql 'select path as id, title, substr(body, 0, 10000) from content.til' \
  -m 3-small \
  --store

Then:

llm rag -d embeddings.db 'what are some good embedding models?' --debug

Output:

Given the context provided, here are some embedding models that have been referenced and utilized in various tasks:

GTR-T5-Large (sentence-transformers/gtr-t5-large)

This model is part of the sentence-transformers library and is trained specifically for semantic search tasks, mapping sentences and paragraphs to a 768-dimensional vector space.

E5-Large-V2 (intfloat/e5-large-v2)

An embeddings model available in sentence-transformers, which is trained for matching queries against text passages.

GPT-3.5-Turbo-16k (gpt-3.5-turbo-16k)

Provided by OpenAI, this model supports a large context window of 16,000 tokens and can be used to generate embeddings and completions for longer chunks of text.

GPT-4-32k-0613 (gpt-4-32k-0613)

Another model from OpenAI that provides an even larger context window of 32,000 tokens, allowing for extensive text analysis and generation tasks.

CLIP Model

Though not specifically mentioned as an embedding model, the Contextual Language-Image Pretraining (CLIP) model generates embeddings for text and images and can be used to find similarities between visual and textual content.

LLaMA Foundation Language Models (7B, 13B, etc.)

Released by Facebook Research, LLaMA models are foundation language models that can be compiled and run on consumer hardware like Apple Silicon MacBooks.

nanoGPT

A lightweight GPT-like model by Andrej Karpathy, which can be used for various natural language processing tasks, including generating embeddings.

These models address different scales and types of tasks, from running embedding calculations on laptops for semantic search to leveraging larger GPT models for extensive context windows and generating embeddings for textual and visual content. The choice of the model generally depends on the specific task requirement, the size of the input data, and the computational resources available.

Plus the debug context which was LONG - it's here: https://gist.githubusercontent.com/simonw/c7c13e5425c59abd64a16d16d934eda9/raw/03f9e357914bd8c1a0380ccb7f4e803c1749fc2d/context.txt

simonw commented 7 months ago

On further thought, I think this tool needs to deal with multiple "rag configurations" - essential for trying out different RAG techniques and parameters. The things that can be configured for a RAG configuration will include:

The LLM model used
The prompt (and system prompt) used to ask the question
The sources of context data. These can be embedding collections, but maybe could be SQL queries too (which would enable FTS)
How many documents to include in the context
Maybe advanced tricks there, like shoveling in documents until a number of tokens has been reached (which implies a need for a mechanism to count tokens)
The embedding prompt - if the model is E5-large-v2 then the user's question needs query: added to the front of it before it is embedded

Since this means elevating RAG configurations to a key concept, maybe the llm rag command should require a named configuration, like this:

llm rag tweets "Who is Ben?"

This would mean dropping the "default collection" concept - I think that's OK.

simonw commented 7 months ago

I'll store RAG configurations in YAML files like .../io.datasette.llm/llm_rag/configs/tweets.yml. These can be edited using llm rag configs edit tweets - similar to how llm templates edit summarize works.

I may have a llm rag configs create ... command that provides a CLI way of creating a new one, rather than making people hand-edit YAML to get started.

simonw commented 7 months ago

I started a conversation about RAG citation tricks here: https://twitter.com/simonw/status/1751841193506550112

Those would definitely require the kind of advanced configuration that might need a YAML configuration file.

simonw commented 7 months ago

To help build datasette-rag later on I should ensure that this CLI and configuration stuff can be done using Python API methods too.

simonw commented 6 months ago

Potentially relevant paper: https://arxiv.org/abs/2401.14887

Our study fills this gap by thoroughly and critically analyzing the influence of IR components on RAG systems. This paper analyzes which characteristics a retriever should possess for an effective RAG's prompt formulation, focusing on the type of documents that should be retrieved. We evaluate various elements, such as the relevance of the documents to the prompt, their position, and the number included in the context. Our findings reveal, among other insights, that including irrelevant documents can unexpectedly enhance performance by more than 30% in accuracy, contradicting our initial assumption of diminished quality.

simonw / llm-rag

Initial plugin design #1