hoarder-app / hoarder

A self-hostable bookmark-everything app (links, notes and images) with AI-based automatic tagging and full text search
https://hoarder.app
GNU Affero General Public License v3.0
3.18k stars 111 forks source link

Can we have vector database along with tags? #267

Open echo-saurav opened 1 month ago

echo-saurav commented 1 month ago

Automatically getting tags with ollama is quite nice ! but i think it would be more awesome if it stored text vector , so we can search by similar text, or make filter to have similar links / image together

MohamedBassem commented 1 month ago

This is definitely planned and I actually have a prototype for it already :) Just trying to find a reasonable vector db without adding extra dependencies.

echo-saurav commented 1 month ago

awsome ! You can see weaviate . it has text and image vector both (I know its not really lightweight , but i like it a lot because of its customisation ability )

ieaves commented 1 month ago

You can actually do vector search in postgres pretty easily. Postgres would also have the side benefit of not requiring the database to be mounted into every container.

MohamedBassem commented 1 month ago

The problem is that hoarder currently doesn't depend on postgres. So introducing postgres now as a dependency will be very disruptive. If I'm to start hoarder from scratch, I'd have gone for postgres for everything (database, FTS, vector search, etc). But it's too late now unfortunately.

ieaves commented 1 month ago

Ahh okay, I'm not familiar with Drizzle but perusing the docs made it look like a fairly simply drop in replacement.

MohamedBassem commented 1 month ago

it's less about the code changes and more about asking every existing user to add a new dependency and migrate their data.

kamtschatka commented 1 month ago

I actually REALLY like the idea of moving to postgres.

Yes, I understand that it is disruptive, but there is already a section in the release notes on what to keep an eye on and if we update the UI to show that you need to add new environment variables with a postgres db and we offer an automatic data migration (could be version 0.16.0 with not much else), we could keep the disruption low AND open up a whole lot of possibilities for us in the future.

MohamedBassem commented 1 month ago
  • Is is a toy db anyways.

I pretty much disagree that sqlite is a toy database. Cloudflare's D1 database for example is built on top of sqlite. Other companies like fly.io and turso are also offering prod databases built on top of sqlite. Tailscale for example, aslo embraced sqlite in prod. We're way way far from approaching the limits of sqlite. It also fits us well because we don't need a client/server architecture given that our deployments are usually on a single machine.

We can get rid of Meilisearch:

Sqlite contains full text search btw (https://www.sqlite.org/fts5.html) and the extension is already enabled in our docker containers. I haven't given it a try so I don't know how good it is compared to meillisearch's. I also didn't give postgres' FTS a try as well. So if getting rid of meillisearch is a goal, there's a route to do it on sqlite as well.

Two limitations that I know about in sqlite's FTS (that I don't know if pg handles better):

  1. I remember searching for good packages in npm to interact with sqlite's FTS, but didn't find decent libraries.
  2. Sqlite's FTS doesn't support fuzzy search, at least natively.

It is currently juggling a lot of stuff around in memory

This can be solved if we're to move to sqlite's FTS.

For the database you no longer need to have the same location mounted on both apps

I've been actually thinking about going the route that immich went. Just merge the workers and web containers into one. That'll simplify the deployment a bit without sacrificing on anything. I initially went with separate container for the worker as the worker was the one spawning the chrome process and I didn't want this to be mixed with the web container. But now chrome is in its own container and we can probably just spin up the workers as a background job inside the web container.

Yes, I understand that it is disruptive

This "IS" my biggest concern. It is very disruptive and we will lose some users because of that move. I'm for example, still stuck on old immich releases because I don't have time to go through all the recent breaking changes that they introduced. I want hoarder to just work for people, and regardless of how many bells we add to the UI, we're going to break some deployment with this migration, and I don't really want this to happen.

I understand that sometimes this is a cost we'll have to pay, but so far, I'm not seeing the strong justification to pay it just yet.

Another Route

There's another route we can take though. We can double down on sqlite:

  1. Merge web and workers container.
  2. Migrate away from meillisearch to sqlite's FTS (if it's good enough).
  3. Migrate away from bullmq to a queue built on top of sqlite. We're not high QPS service anyways, so it shouldn't be that hard.
  4. There's a WIP sqlite vector search extension (https://github.com/asg017/sqlite-vec) that I've been keeping an eye and seems like recently it got sponsors from mozilla, turso, fly, etc. We can adopt this once it's mature for our vector database as well.