nickthecook / archyve

GNU Affero General Public License v3.0
122 stars 15 forks source link

Idea: simplify your system architecture by dropping chromadb - using PG exclusively #85

Closed MadBomber closed 1 month ago

MadBomber commented 1 month ago

PostgreSQL is a workhorse. I don't think you really need a separate vector database. The pgvector extension with the neighbor gem is a good enough solution. Keeping the entire system within the context of an ROR application would be my goal.

btw: I like the command line utility tbls (brew installable) for documenting the database schema. I also like using the comment clause on each table and each column so that we don't forget what the intent of that table/column was originally supposed to be. With team members coming and going and over time, without good documentation of the schema sometimes data objects get used incorrectly.

back to pg extensions ... I recently ran across the pgai extension which seems to be more python oriented; but, worth a look. I think that it actually incorporates pgvector along with some extra functions.

I like the idea that I see in the docs of creating summaries and generating embeddings on the summaries as well as the document chunks. I believe that may mitigate some of the problems where there are multiple authors with different writing styles within and through a document collection.

In terms of the database schema it might be interesting to look at adding an FAQ table. Whether this is within the context of a collection are across the board is tbd. The idea is that ether the user or the system creates an FAQ entry so that on a query, those FAWs that semantically are close to the query get added to the query response - maybe as a "see also" kind of link if not specifically included within the response to the user.

Another possible improvement is looking at generating potential question/answer pairs based upon the document chunks as they are being processed. This might actually tie in well with the knowledge graph that is being generated.

Thanks for sharing your project. Its going to be lots of fun to play with.

Dewayne o-*

nickthecook commented 1 month ago

Some good ideas, thanks for the input.

Dropping ChromaDB in favour of pg_vector would simplify things. That said, it's not a high priority, since ChromaDB is implemented and has been no trouble so far. One would also need to handle the migration from ChromaDB to pg_vector, which is a bit of extra work. Worth trying someday, though.

The FAQ thing you mentioned sounds like a "Facts" feature I was thinking of adding. I'd like to be able to just enter simple chunks of text via the UI that would be included in responses where they're relevant. I was thinking independent of Collections, or maybe in one "special" Collection, but optional on any given query. Then I'd like to be able to say, in a prompt, "fact: the name of that character I can never remember from Lost is 'Desmond'", or "fact: the machine in my basement running HomeAssistant is a Raspberry Pi 4" and have it remember that.

In general, I think Chunks should be able to come from many sources, not just Documents, and maybe Facts and/or FAQs are two of those sources. Web scraping could also be a source. Semantic search will continue to return relevant Chunks, not caring where they come from.

oxaronick commented 1 month ago

Another possible improvement is looking at generating potential question/answer pairs based upon the document chunks as they are being processed. This might actually tie in well with the knowledge graph that is being generated.

Interested in hearing more about this.

Doing similarity search for a question among a bunch of vectorized answers seems pretty effective, but would be even better if searching in a list of questions. A chunk could be embedded with a question in the embedding_content field (which is what gets put in the vector DB), but the answer to that question is in the content field (which is what gets returned as an augmentation).

MadBomber commented 1 month ago

@oxaronick this is one of the reasons why I like the use of PostgreSQL in AI applications. You can take advantage of the semantic search capabilities of the embeddings vector along with other faster search capabilities with other columns or table associations. For example you can have a seperate embeddings table which belongs to a chunk content table which belongs to a documents table. You can save complete FAQs in another table along with generated questions table owned be the documents and or chunks tables. You don't necessarily have to hit that embeddings table every time to satistify a common question.

oxaronick commented 1 month ago

I looked into pgai a bit, and I don't think I'll use that. If I wasn't building an app that integrated with inference servers I might want to use it. For this app, though, it just seems like it would be pushing a major concern onto an existing system component out of convenience, resulting in a much more complex architecture.

I also looked at pgvector, and it seems a little more complex to implement than ChromaDB, but I can see potential advantages there. I'll keep my eyes open for concrete advantages, and if I can find one I'll look at switching.

In the meantime, Postgres is an excellent SQL database, and that's why I'm using it!