simonw / llm

Access large language models from the command-line
https://llm.datasette.io
Apache License 2.0
4.05k stars 225 forks source link

Support for chunked embeddings #220

Open simonw opened 1 year ago

simonw commented 1 year ago

It would be useful if there was a way to pipe in content to be embedded (both to llm embed and to llm embed-multi) and specify that it should be chunked, using various chunking mechanisms that can be provided by plugins.

simonw commented 1 year ago

First question is how these should be stored. One way could be to add a chunks optional integer column to the embeddings table - that way when something is chunked it could be stored as multiple rows.

But what would the ID be? Still the ID of the document? So maybe chunk becomes part of a compound primary key?

More importantly, what if you want to chunk a document in multiple different ways. I could imagine wanting to chunk by both sentence and section (in a .rst file) for example. How should the database differentiate between sentence-chunks and section-chunks for the same document?

One solution would be to use a different collection for them - docs-sentences for one and docs-sections for another. That's not a terrible idea, but is it ergonomic enough?

Eventually I'm going to want to support quite complex mechanisms for RAG (Retrieval Augmented Generation), such as the ability to search for relevant content and then stitch together matching sentences, the 1 or 2 sentences before and after them... but also potentially whole matching sections.

simonw commented 1 year ago

Is there a simple basic starting point for chunking I could implement in a way that would let me add more advanced patterns in the future without breaking anything?

simonw commented 1 year ago

I think all embedding vectors go in the same place, the embeddings table. Whether or not they are for a whole document, or a section, or a sentence needs to be represented in that table with enough fidelity that useful things can be done with the information later on.

simonw commented 1 year ago

Maybe the solution is something like this:

The chunking strategy is an interesting thing - it needs to track both the chunking function (which comes from a plugin, e.g. sentence splitting) plus any settings that were passed to that chunking function like the min/max length of a sentence.

Which suggests to me that strategy should be a foreign key against another table - so the embeddings table supports this new mechanism with just a new strategy integer column and a new chunk integer column.

But... now we have a problem with IDs. These are currently strings - if a document has 1000 chunks we will end up repeating the same string ID 1000 times.

So maybe documents are a separate concept with their own table? Then the schema would look like this:

CREATE TABLE [collections] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT,
   [model] TEXT
);
CREATE UNIQUE INDEX [idx_collections_name]
    ON [collections] ([name]);

CREATE TABLE [strategies] (
   [id] INTEGER PRIMARY KEY,
   [chunk_function] TEXT,
   [settings] TEXT
);
)

CREATE TABLE [documents] (
   [document_id] INTEGER PRIMARY KEY,
   [collection_id] INTEGER REFERENCES [collections]([id]),
   [id] TEXT
);

CREATE TABLE "embeddings" (
   [document_id] INTEGER REFERENCES [documents]([document_id]),
   [strategy] INTEGER REFERENCES [strategies]([id]),
   [chunk_index] INTEGER,
   [embedding] BLOB,
   [content] TEXT,
   [content_hash] BLOB,
   [metadata] TEXT,
   [updated] INTEGER,
   PRIMARY KEY ([document_id], [strategy], [id])
);
CREATE INDEX [idx_embeddings_content_hash]
    ON [embeddings] ([content_hash]);

This is a fair bit more complicated!

Now every item that is embedded (even a one-word string) gets a documents record showing which collection it belongs to, and a embeddings record that marks its strategy and chunk_index columns as null (for no chunking strategy)`.

Once documents start getting chunked, the schema feels more sensible.

simonw commented 1 year ago

The alternative to the above would be just adding strategy and chunk_index columns to the existing embeddings table. I'm likely WAY over-thinking the cost of continuing to use a string id column even for documents with thousands of chunks.

simonw commented 1 year ago

Here's that table right now:

CREATE TABLE "embeddings" (
   [collection_id] INTEGER REFERENCES [collections]([id]),
   [id] TEXT,
   [embedding] BLOB,
   [content] TEXT,
   [content_hash] BLOB,
   [metadata] TEXT,
   [updated] INTEGER,
   PRIMARY KEY ([collection_id], [id])
);
CREATE INDEX [idx_embeddings_content_hash]
    ON [embeddings] ([content_hash]);

I'd need to add chunk_strategy_id and chunk_index columns, but I would also need to update the primary key to incorporate those:

CREATE TABLE "embeddings" (
   [collection_id] INTEGER REFERENCES [collections]([id]),
   [id] TEXT,
   [chunk_strategy_id] INTEGER REFERENCES [strategies]([id]),
   [chunk_index] INTEGER,
   [embedding] BLOB,
   [content] TEXT,
   [content_hash] BLOB,
   [metadata] TEXT,
   [updated] INTEGER,
   PRIMARY KEY ([collection_id], [id], [chunk_strategy_id], [chunk_index])
);

Is it OK to have nullable columns in a compound primary key like that? It's SQLite so it probably works, but would I risk having multiple rows with the same primary key?

Maybe I use 0 and 0 to mark embeddings that are not part of a chunking strategy. Is it OK ot do that even though chunk_strategy_id is a foreign key reference and the 0 will not resolve?

simonw commented 1 year ago

Yes, I'm over-thinking this schema. If a user cares that much about the space taken up by those IDs they can themselves use shorter IDs and implement their own lookup tables.

I'm going to do a spike on that second, simpler schema and see what it feels like.

simonw commented 1 year ago

Ran an experiment here: https://chat.openai.com/share/3f53008c-0d45-438b-a801-44ad88990f25

If one of the columns in a compound primary key can contain null, it's possible for two rows to have the same primary key.

I really don't like that.

One solution: the database migrations could populate the strategies table with an initial row with ID 1 which represents "no chunking" - then all of the rows in embeddings could reference that and avoid nulls.

simonw commented 1 year ago

A really simple chunker I could include by default would be lines - it splits on newlines and discards any empty ones.

A chunker function gets fed text and returns an iterator over chunks in that text.

simonw commented 1 year ago

Initial prototype thoughts:

diff --git a/llm/default_plugins/chunkers.py b/llm/default_plugins/chunkers.py
new file mode 100644
index 0000000..23fa750
--- /dev/null
+++ b/llm/default_plugins/chunkers.py
@@ -0,0 +1,13 @@
+from llm import hookimpl
+
+
+def lines(text):
+    "Chunk text into lines"
+    for line in text.split("\n"):
+        if line.strip():
+            yield line
+
+
+@hookimpl
+def register_chunker_functions(register):
+    register(lines, name="lines")
diff --git a/llm/hookspecs.py b/llm/hookspecs.py
index e7f806b..8f179c3 100644
--- a/llm/hookspecs.py
+++ b/llm/hookspecs.py
@@ -18,3 +18,8 @@ def register_models(register):
 @hookspec
 def register_embedding_models(register):
     "Register additional model instances that can be used for embedding"
+
+
+@hookspec
+def register_chunker_functions(register):
+    "Register additional chunker functions, for chunking up text"
diff --git a/llm/plugins.py b/llm/plugins.py
index 230b41c..6f2a0fe 100644
--- a/llm/plugins.py
+++ b/llm/plugins.py
@@ -3,7 +3,7 @@ import pluggy
 import sys
 from . import hookspecs

-DEFAULT_PLUGINS = ("llm.default_plugins.openai_models",)
+DEFAULT_PLUGINS = ("llm.default_plugins.openai_models", "llm.default_plugins.chunkers")

 pm = pluggy.PluginManager("llm")
 pm.add_hookspecs(hookspecs)
simonw commented 1 year ago

If I do get chunking working, the obvious related feature is a search that's "smarter" than the current llm similar command, by being chunk-aware.

I'm not sure what this would actually mean though. Maybe it knows how to search across the different chunk types but use that information to return just single documents scored based on how many of their chunks matched and how strongly?

simonw commented 1 year ago

There's a related feature here that I might want to roll into this database schema: the ability to attach embeddings to a document that aren't actually from its content at all.

The two most common cases I've seen here are embedding a summary of the whole document, and embedding one or more synthesized questions that can be answered by the document.

For example:

cat docs/python-api.md | llm -s 'Questions that are answered by this document, as a JSON list'
{
  "questions": [
    "How do I run a prompt using the Python API?",
    "How do I provide a system prompt using the Python API?",
    "How do I use models from plugins with the Python API?",
    "How do I stream responses using the Python API?",
    "How do I use conversations with the Python API?",
    "What utility functions are available in the Python API?",
    "How do I set an alias using the Python API?",
    "How do I remove an alias using the Python API?"
  ]
}

This could also be thought of as a "chunking strategy", where the strategy is to synthesize entirely new invented chunks of the document.

simonw commented 1 year ago

I had trouble with that prompt. I really want a newline-delimited list of questions, but:

cat docs/python-api.md | llm -s 'Questions that are answered by this document, as a newline separated list'
The Python API for executing prompts in LLM provides the following functionality:

- Basic prompt execution: You can run a prompt against a specific model using the `prompt()` method. The `llm.get_model()` function is used to get the model object. Here is an example:
...

Not sure how to prompt it to actually get back that list. Maybe with an example?

That seemed to work:

cat docs/python-api.md | llm -s 'Questions that are answered by this document, as a newline separated list 
For example:

How do I install this software?
What can this software do?'
How do I execute prompts using the Python API?
How do I use system prompts in the Python API?
How can I use models from plugins in the Python API?
How do I stream responses using the Python API?
How do I use conversations in the Python API?
What other utility functions are provided in the Python API?

These examples all used gpt-3.5-turbo.

phirsch commented 1 year ago

What about storing the actual location/extent of the chunk content region within the document along with the strategy (in addition to or even instead of an index)? That way, you could retrieve the content for RAG purposes (after checking the checksum is still valid).

P.S.: Originally wanted to ask about potential RAG support before I noticed that you are actually planning to implement this anyway - great to hear! (Would love to see some PDF or other format pre-processing as well, e.g. for layout-aware content extraction, particularly for scientific papers, but maybe that could be separate tools, although there might be some relevant interaction with chunking there.)

vividfog commented 1 year ago

I'm not sure what this would actually mean though. Maybe it knows how to search across the different chunk types but use that information to return just single documents scored based on how many of their chunks matched and how strongly?

I'm not sure if I understood the line of thinking, but in Q&A RAG from lots of files, the chunks alone, top-k assembled from multiple files, when put together as an INFO_UPDATE into the prompt ... the LLM just somehow figures out the final answer based on that information dump. And generally does a good job if it's a good model.

There may be other use cases, in which the chunks need to be re-assembled into continuous text, but there's plenty of value in treating them as what they are: chunks. Where they came from, that's metadata that can also be used as part of the prompt.

Something like:

Below you receive an information update between INFO_UPDATE_BEGIN and INFO_UPDATE_END tags:

INFO_UPDATE_BEGIN
SOURCE: <filename>
CONTENT: <chunk>
SOURCE: <filename>
CONTENT: <chunk>
INFO_UPDATE_END

Answer the user's question using this information, citing the relevant SOURCE and CONTENT.

USER QUESTION: {question}

That's a rough pattern, a part of the assembled prompt. I have no idea if capitalisation makes any big difference, but that's how I roll it. No harm in trying to be specific.

It is a nice feature, if the chunk procedure remembers the line numbers which this chunk is a part of. That's potential metadata to be queried at RAG runtime, even filled into the prompt.

stoerr commented 11 months ago

I really like that idea. I was just experimenting with indexing my documents and even the code in some of my projects with llm multi-embed + llm similar for use with my developers toolbench ChatGPT plugin, but found that the results are difficult to use, since you either get just the hopefully relevant files you'd have to process further (if not using --store) or get a mother huge result that'd swamp ChatGPT because the stored content is just the whole file transmitted in the JSON. Chunking might cut that down and make it directly useable. I like your idea with chunking strategies, also from the angle that it'd limit the recomputation of embeddings if the files are changed, and you could have custom strategies that e.g. just pick out the documentation comments and method headers in code, and that idea of having the document preprocessed by the LLM into a short snippet about what types of questions the document can answer is Albert Einstein level usage. :-)

One idea. You wrote recently about the "scroll to text" feature in browsers. It'd be conceiveable to use that in the IDs to identify the chunk: documentation/features/Foobar.md#:~:text=The foo feature is,observing bar. Of course that has some obvious problems, like needing some work and heuristics to be unique and somewhat stable against changes in the document, but it could even be directly useable if you want to display the document in a browser and might even work without changing the database model at all. (That is, up to reindexing a document: you'd need to purge obsolete snippets, so you have to search for IDs pointing into the old version of the document.)

stoerr commented 11 months ago

One more point. You might want to generalize the "chunking strategy" concept. For one, for that idea of extracting method headers and documentation comments from code as "chunks" one might want to include informations about the class in which that fragment occurs, too, or for a text document a "chunk" might include e.g. information about the document heading and section heading, too, and your idea of preprocessing the whole document with the LLM could also be viewed as a generalized chunking strategy. So the "chunking strategy" would become a mapping of a string (possibly including metadata like file name) into a set of (id and string used for embedding), where that string used for embedding might or might not actually occur in the original document. If the strategy becomes expensive (like involving an actual LLM call), that'd need a mechanism to prevent calling it on an unchanged document, though.

This might be overdoing it, but I think there is a merit of doing that within llm itself if you want to efficiently (re-)index a document tree. And generalizing it might just mean having a general enough interface to be implemented by plugins.