Open simonw opened 1 year ago
First question is how these should be stored. One way could be to add a chunks
optional integer column to the embeddings
table - that way when something is chunked it could be stored as multiple rows.
But what would the ID be? Still the ID of the document? So maybe chunk
becomes part of a compound primary key?
More importantly, what if you want to chunk a document in multiple different ways. I could imagine wanting to chunk by both sentence and section (in a .rst
file) for example. How should the database differentiate between sentence-chunks and section-chunks for the same document?
One solution would be to use a different collection for them - docs-sentences
for one and docs-sections
for another. That's not a terrible idea, but is it ergonomic enough?
Eventually I'm going to want to support quite complex mechanisms for RAG (Retrieval Augmented Generation), such as the ability to search for relevant content and then stitch together matching sentences, the 1 or 2 sentences before and after them... but also potentially whole matching sections.
Is there a simple basic starting point for chunking I could implement in a way that would let me add more advanced patterns in the future without breaking anything?
I think all embedding vectors go in the same place, the embeddings
table. Whether or not they are for a whole document, or a section, or a sentence needs to be represented in that table with enough fidelity that useful things can be done with the information later on.
Maybe the solution is something like this:
id
- the ID of the documentstrategy
- the optional chunking strategy, e.g. paragraphs or sections or sentenceschunk
- the index, from 0 upwards, of this specific chunkembedding
... and the rest of the columnsThe chunking strategy is an interesting thing - it needs to track both the chunking function (which comes from a plugin, e.g. sentence splitting) plus any settings that were passed to that chunking function like the min/max length of a sentence.
Which suggests to me that strategy
should be a foreign key against another table - so the embeddings
table supports this new mechanism with just a new strategy
integer column and a new chunk
integer column.
But... now we have a problem with IDs. These are currently strings - if a document has 1000 chunks we will end up repeating the same string ID 1000 times.
So maybe documents
are a separate concept with their own table? Then the schema would look like this:
CREATE TABLE [collections] (
[id] INTEGER PRIMARY KEY,
[name] TEXT,
[model] TEXT
);
CREATE UNIQUE INDEX [idx_collections_name]
ON [collections] ([name]);
CREATE TABLE [strategies] (
[id] INTEGER PRIMARY KEY,
[chunk_function] TEXT,
[settings] TEXT
);
)
CREATE TABLE [documents] (
[document_id] INTEGER PRIMARY KEY,
[collection_id] INTEGER REFERENCES [collections]([id]),
[id] TEXT
);
CREATE TABLE "embeddings" (
[document_id] INTEGER REFERENCES [documents]([document_id]),
[strategy] INTEGER REFERENCES [strategies]([id]),
[chunk_index] INTEGER,
[embedding] BLOB,
[content] TEXT,
[content_hash] BLOB,
[metadata] TEXT,
[updated] INTEGER,
PRIMARY KEY ([document_id], [strategy], [id])
);
CREATE INDEX [idx_embeddings_content_hash]
ON [embeddings] ([content_hash]);
This is a fair bit more complicated!
Now every item that is embedded (even a one-word string) gets a documents
record showing which collection it belongs to, and a embeddings
record that marks its strategy
and chunk_index
columns as null
(for no chunking strategy)`.
Once documents start getting chunked, the schema feels more sensible.
The alternative to the above would be just adding strategy
and chunk_index
columns to the existing embeddings
table. I'm likely WAY over-thinking the cost of continuing to use a string id
column even for documents with thousands of chunks.
Here's that table right now:
CREATE TABLE "embeddings" (
[collection_id] INTEGER REFERENCES [collections]([id]),
[id] TEXT,
[embedding] BLOB,
[content] TEXT,
[content_hash] BLOB,
[metadata] TEXT,
[updated] INTEGER,
PRIMARY KEY ([collection_id], [id])
);
CREATE INDEX [idx_embeddings_content_hash]
ON [embeddings] ([content_hash]);
I'd need to add chunk_strategy_id
and chunk_index
columns, but I would also need to update the primary key to incorporate those:
CREATE TABLE "embeddings" (
[collection_id] INTEGER REFERENCES [collections]([id]),
[id] TEXT,
[chunk_strategy_id] INTEGER REFERENCES [strategies]([id]),
[chunk_index] INTEGER,
[embedding] BLOB,
[content] TEXT,
[content_hash] BLOB,
[metadata] TEXT,
[updated] INTEGER,
PRIMARY KEY ([collection_id], [id], [chunk_strategy_id], [chunk_index])
);
Is it OK to have nullable columns in a compound primary key like that? It's SQLite so it probably works, but would I risk having multiple rows with the same primary key?
Maybe I use 0 and 0 to mark embeddings that are not part of a chunking strategy. Is it OK ot do that even though chunk_strategy_id
is a foreign key reference and the 0 will not resolve?
Yes, I'm over-thinking this schema. If a user cares that much about the space taken up by those IDs they can themselves use shorter IDs and implement their own lookup tables.
I'm going to do a spike on that second, simpler schema and see what it feels like.
Ran an experiment here: https://chat.openai.com/share/3f53008c-0d45-438b-a801-44ad88990f25
If one of the columns in a compound primary key can contain null, it's possible for two rows to have the same primary key.
I really don't like that.
One solution: the database migrations could populate the strategies
table with an initial row with ID 1
which represents "no chunking" - then all of the rows in embeddings
could reference that and avoid nulls.
A really simple chunker I could include by default would be lines
- it splits on newlines and discards any empty ones.
A chunker function gets fed text and returns an iterator over chunks in that text.
Initial prototype thoughts:
diff --git a/llm/default_plugins/chunkers.py b/llm/default_plugins/chunkers.py
new file mode 100644
index 0000000..23fa750
--- /dev/null
+++ b/llm/default_plugins/chunkers.py
@@ -0,0 +1,13 @@
+from llm import hookimpl
+
+
+def lines(text):
+ "Chunk text into lines"
+ for line in text.split("\n"):
+ if line.strip():
+ yield line
+
+
+@hookimpl
+def register_chunker_functions(register):
+ register(lines, name="lines")
diff --git a/llm/hookspecs.py b/llm/hookspecs.py
index e7f806b..8f179c3 100644
--- a/llm/hookspecs.py
+++ b/llm/hookspecs.py
@@ -18,3 +18,8 @@ def register_models(register):
@hookspec
def register_embedding_models(register):
"Register additional model instances that can be used for embedding"
+
+
+@hookspec
+def register_chunker_functions(register):
+ "Register additional chunker functions, for chunking up text"
diff --git a/llm/plugins.py b/llm/plugins.py
index 230b41c..6f2a0fe 100644
--- a/llm/plugins.py
+++ b/llm/plugins.py
@@ -3,7 +3,7 @@ import pluggy
import sys
from . import hookspecs
-DEFAULT_PLUGINS = ("llm.default_plugins.openai_models",)
+DEFAULT_PLUGINS = ("llm.default_plugins.openai_models", "llm.default_plugins.chunkers")
pm = pluggy.PluginManager("llm")
pm.add_hookspecs(hookspecs)
If I do get chunking working, the obvious related feature is a search that's "smarter" than the current llm similar
command, by being chunk-aware.
I'm not sure what this would actually mean though. Maybe it knows how to search across the different chunk types but use that information to return just single documents scored based on how many of their chunks matched and how strongly?
There's a related feature here that I might want to roll into this database schema: the ability to attach embeddings to a document that aren't actually from its content at all.
The two most common cases I've seen here are embedding a summary of the whole document, and embedding one or more synthesized questions that can be answered by the document.
For example:
cat docs/python-api.md | llm -s 'Questions that are answered by this document, as a JSON list'
{
"questions": [
"How do I run a prompt using the Python API?",
"How do I provide a system prompt using the Python API?",
"How do I use models from plugins with the Python API?",
"How do I stream responses using the Python API?",
"How do I use conversations with the Python API?",
"What utility functions are available in the Python API?",
"How do I set an alias using the Python API?",
"How do I remove an alias using the Python API?"
]
}
This could also be thought of as a "chunking strategy", where the strategy is to synthesize entirely new invented chunks of the document.
I had trouble with that prompt. I really want a newline-delimited list of questions, but:
cat docs/python-api.md | llm -s 'Questions that are answered by this document, as a newline separated list'
The Python API for executing prompts in LLM provides the following functionality:
- Basic prompt execution: You can run a prompt against a specific model using the `prompt()` method. The `llm.get_model()` function is used to get the model object. Here is an example:
...
Not sure how to prompt it to actually get back that list. Maybe with an example?
That seemed to work:
cat docs/python-api.md | llm -s 'Questions that are answered by this document, as a newline separated list
For example:
How do I install this software?
What can this software do?'
How do I execute prompts using the Python API?
How do I use system prompts in the Python API?
How can I use models from plugins in the Python API?
How do I stream responses using the Python API?
How do I use conversations in the Python API?
What other utility functions are provided in the Python API?
These examples all used gpt-3.5-turbo
.
What about storing the actual location/extent of the chunk content region within the document along with the strategy (in addition to or even instead of an index)? That way, you could retrieve the content for RAG purposes (after checking the checksum is still valid).
P.S.: Originally wanted to ask about potential RAG support before I noticed that you are actually planning to implement this anyway - great to hear! (Would love to see some PDF or other format pre-processing as well, e.g. for layout-aware content extraction, particularly for scientific papers, but maybe that could be separate tools, although there might be some relevant interaction with chunking there.)
I'm not sure what this would actually mean though. Maybe it knows how to search across the different chunk types but use that information to return just single documents scored based on how many of their chunks matched and how strongly?
I'm not sure if I understood the line of thinking, but in Q&A RAG from lots of files, the chunks alone, top-k assembled from multiple files, when put together as an INFO_UPDATE into the prompt ... the LLM just somehow figures out the final answer based on that information dump. And generally does a good job if it's a good model.
There may be other use cases, in which the chunks need to be re-assembled into continuous text, but there's plenty of value in treating them as what they are: chunks. Where they came from, that's metadata that can also be used as part of the prompt.
Something like:
Below you receive an information update between INFO_UPDATE_BEGIN and INFO_UPDATE_END tags:
INFO_UPDATE_BEGIN
SOURCE: <filename>
CONTENT: <chunk>
SOURCE: <filename>
CONTENT: <chunk>
INFO_UPDATE_END
Answer the user's question using this information, citing the relevant SOURCE and CONTENT.
USER QUESTION: {question}
That's a rough pattern, a part of the assembled prompt. I have no idea if capitalisation makes any big difference, but that's how I roll it. No harm in trying to be specific.
It is a nice feature, if the chunk procedure remembers the line numbers which this chunk is a part of. That's potential metadata to be queried at RAG runtime, even filled into the prompt.
I really like that idea. I was just experimenting with indexing my documents and even the code in some of my projects with llm multi-embed + llm similar for use with my developers toolbench ChatGPT plugin, but found that the results are difficult to use, since you either get just the hopefully relevant files you'd have to process further (if not using --store) or get a mother huge result that'd swamp ChatGPT because the stored content is just the whole file transmitted in the JSON. Chunking might cut that down and make it directly useable. I like your idea with chunking strategies, also from the angle that it'd limit the recomputation of embeddings if the files are changed, and you could have custom strategies that e.g. just pick out the documentation comments and method headers in code, and that idea of having the document preprocessed by the LLM into a short snippet about what types of questions the document can answer is Albert Einstein level usage. :-)
One idea. You wrote recently about the "scroll to text" feature in browsers. It'd be conceiveable to use that in the IDs to identify the chunk: documentation/features/Foobar.md#:~:text=The foo feature is,observing bar. Of course that has some obvious problems, like needing some work and heuristics to be unique and somewhat stable against changes in the document, but it could even be directly useable if you want to display the document in a browser and might even work without changing the database model at all. (That is, up to reindexing a document: you'd need to purge obsolete snippets, so you have to search for IDs pointing into the old version of the document.)
One more point. You might want to generalize the "chunking strategy" concept. For one, for that idea of extracting method headers and documentation comments from code as "chunks" one might want to include informations about the class in which that fragment occurs, too, or for a text document a "chunk" might include e.g. information about the document heading and section heading, too, and your idea of preprocessing the whole document with the LLM could also be viewed as a generalized chunking strategy. So the "chunking strategy" would become a mapping of a string (possibly including metadata like file name) into a set of (id and string used for embedding), where that string used for embedding might or might not actually occur in the original document. If the strategy becomes expensive (like involving an actual LLM call), that'd need a mechanism to prevent calling it on an unchanged document, though.
This might be overdoing it, but I think there is a merit of doing that within llm itself if you want to efficiently (re-)index a document tree. And generalizing it might just mean having a general enough interface to be implemented by plugins.
It would be useful if there was a way to pipe in content to be embedded (both to
llm embed
and tollm embed-multi
) and specify that it should be chunked, using various chunking mechanisms that can be provided by plugins.