simonw / llm

Access large language models from the command-line
https://llm.datasette.io
Apache License 2.0
3.95k stars 212 forks source link

Save embeddings in existing db? #318

Open tf13 opened 10 months ago

tf13 commented 10 months ago

Curious if it's possible to save embeddings in an existing db, rather than a dedicated SQLite database.

Use-case: I'll be running a semantic search app in a cloud container. Redeploying will replace any updates to the db. It would be great to be able to pull existing embeddings in from a remote database, especially if it can be the one I'm already using for another element of the project.

simonw commented 10 months ago

Curious if it's possible to save embeddings in an existing db, rather than a dedicated SQLite database.

It's not, but one of the reasons I'm building everything around SQLite is it makes it very easy to use other tools in that ecosystem (like sqlite-utils) to further export and manipulate the data.

That said... getting the embeddings back out of LLM isn't nearly easy enough yet!

What would your ideal workflow look like here? If LLM has eg calculated embeddings for 500 files (or other inputs) what would you then like to be able to do with them?

tf13 commented 10 months ago

It's hard to say, since I'm just getting started implementing semantic search (but I have big plans...). I think I'd want to be able to process a bunch of texts, save the emebeddings — and then put them somewhere (a database) of my own choosing (from which I could retrieve them for comparison when searching).

For my own use, a local SQLite database is fine. But for one project I'm working on, the database needs to be available from a remote (dockerized) app. It would be much better if I could just make a call to the remote Postgres database with SQLAlchemy etc. than to have to push the SQLite database each time I deploy the app — or redeploy the app each time I want to use updated data.

Not sure if that's specific (or clear) enough

Analect commented 2 months ago

@simonw ... inspired by your talk, brought to my attention by Hugo at metaflow, I wanted to add a few thoughts here. Rather than creating a new issue, I thought I'd add to this existing related one. Hope you don't mind @tf13 .

Your talk touched on your usage of sqlite, and the brute-force required for semantic searches. I've started working more with lancedb recently, which is part of a new class of open-source embedded databases which seem well suited to the type of exploratory workflows enabled by your llm command-line tool. This recent blog from them offers a good summary why their innovations are useful at the intersection of LLMs and data, including versioning, zero-copy schema evolution and multi-modal storage.

What would your ideal workflow look like here? If LLM has eg calculated embeddings for 500 files (or other inputs) what would you then like to be able to do with them?

One nice aspect of these embedded databases is that you don't need a server to run them. For a use-case where you are collaborating with others, then locating the db file on s3express is sufficient, giving very performant latency.

Having the ability to point the llm command-line tool at pre-existing remote storage and resume experimentation would be great. Is this something that could be envisaged?