neuml / txtai

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
https://neuml.github.io/txtai
Apache License 2.0
9.56k stars 612 forks source link

Allow save embeddings (the tmpfile.npy); or option to return embeddings with query #395

Closed lefnire closed 1 year ago

lefnire commented 1 year ago

I'd like to work with the actual numpy embeddings (the ones buffered here). Context on why later. I thought it was storevectors, but that looks to save the transformers model; and the embeddings file is the ANN index. I see two solutions: add a vector field to the SQL search query (eg, select id, text, vector from txtai) which will run faiss.reconstruct() here. I don't know how reconstruct works, but I see it used for that purpose in haystack. My concern is that this solution is lossy (does reconstruct try to regenerate the original vector?). Also you'd have to have an equivalent for the other ANNs. So I think a better approach is:

Add another config like save_actual_embeddings, which will copy over the tmpfile.npy. BUT, it would be hard to map the original IDs in - unless you did something dirty like the first column of the numpy array is the id (text, where the other columns are floats). I think a better approach is to save it as Pandas (or Feather / Arrow / whatever), with an ID column and vector column. I think this could be a simple add. But you could take it even further and scrap sqlite, just use the Pandas dataframe (wherein you can do sql-like queries, fields, etc) - two birds.


The context (feel free to skip). I'm building search. When they search, they get their search results like normal. But also, in the side bar there are a bunch of resources related to the results of their search. So it would take the vectors of all the returned results, np.mean() them, then send them to the various resource searches. Specifically it's a journal app, and a user will search through past entries. The result of that search will be say 10 entries. np.mean(10_entry_vectors) now goes to book-search, therapist-search, etc. So I don't want to search books/therapists from their query, but from their entries which result from their query. I also want to cluster their resultant entries (I still need to explore txtai's graph feature), for which I'd need the vectors). And finally (I'll submit a separate ticket, I'd like to search directly by vector, not text (using said mean).

davidmezzetti commented 1 year ago

Thank you for writing this issue up (and #396) with details. I really appreciate the context as that was my first question when I read the title.

Given you are only working on a list of search results, could you just take the text field of each returned element and call embeddings.batchtransform? I wouldn't expect that to be that intensive of an operation, assuming it's only something like 10 results.

This still requires the change in #396 to work. But with that trade of not encoding the query perhaps it would (almost) even out. In looking at the code, probably can also make embeddings.batchtransform more efficient but still don't think it would be that slow for small result sets. And probably would be a similar speed to something like faiss_reconstruct().

lefnire commented 1 year ago

Alas, it'd be too heavy. Here's some deeper context (sorry long). Users create long-form journal entries, about the length of a blog post / news article. On save, entry is split into paragraphs, each paragraph is embedded, entry.vector=np.mean(paras). This because of max-token limitations of embedders. So a single entry can be ~5 embed()s. A user might have thousands of entries, and a filter result would actually show more like 100s. A common use-case is "show my my last year" (summarize, recommend resources, etc).These filters could be applied rapid-fire, like a faceted search. So the pre-computed vectors is key. There's another complication: users belong to groups. They're matched to groups based on their entire np.mean(all_entries), which happens after each new entry. So we wouldn't want groups computing all-user's all-entries.

Sounds like it might be counter to txtai philosophy, so I'll look into subclassing / extending. No worries.

I might take a stab at PyArrow replacing sqlite, which is pretty slick with its mmap / sharding / compression /etc. Still allows for filtering on columns, as well as over partitions (eg S3 folder paths). Then I could save the np.array() along with. I'll post back if I go this direction

davidmezzetti commented 1 year ago

Makes sense.

Given we're talking embeddings, I think looking at the ANN would be the best path. It's possible to create a custom ANN. This has the benefit of keeping the embeddings array in sync with update/deletes. This custom ANN instance could wrap another underlying ANN instance and also keep a local copy of the input embeddings, which could be stored as a NumPy array, Torch, arrow or the format of your choosing.

Rough code layout:

from txtai.ann import ANN

class CustomANN(ANN):
      def __init__(self, config):
          super().__init__(config)

      def index(self, embeddings):
         self.backingann.index(embeddings)
         self.embeddings = embeddings

      def lookup(self, indexid):
           return self.embeddings[indexid]

      # Implement remaining ANN methods

Then from the Embeddings instance, config can be defined to access this.

embeddings = Embeddings({
  "path": "sentence-transformers/all-MiniLM-L6-v2",
  "content": True,
  "backend": "CustomANN",
  "functions": [
    {"name": "vector", "function": "ann.lookup"}
  ]
})

This should then make it possible to run a SQL statement like this.

SELECT id, text, vector(indexid) FROM txtai WHERE similar('query')

With the next major release (6.x), the plan is to make things like this easier. For example, multiple indexes for a single Embeddings instance.

lefnire commented 1 year ago

This is great, wonderful jumping off point - thanks for making the time to write that up!

davidmezzetti commented 1 year ago

No problem, let me know how it goes. I will add #396 in for the next release.

davidmezzetti commented 1 year ago

One additional thing, since performance is the main factor here. It might be better to select the indexids from the query then use those indexes to slice the embeddings array and do the mean all in one operation.

davidmezzetti commented 1 year ago

Closing this due to inactivity. Please re-open or open a new issue if this persists.