jkomoros / card-web

The web app behind thecompendium.cards
Apache License 2.0
46 stars 8 forks source link

Use OpenAI embeddings for similarity #646

Open jkomoros opened 1 year ago

jkomoros commented 1 year ago

Use https://beta.openai.com/docs/guides/embeddings/use-cases to calculate similarity.

If an openai_secret_key is provided in config.SECRET.json then it activates embedding-based similarity.

A new cloud function is set up that when a card's title or body is modified it's flagged to have its embedding fetched. (The fetching has to happen serverside to protect the secret key). By having it be driven off of cards being edited, we could dratt off the firestore permissions to make it hard to abuse our embedding secret key. Getting an embedding for a live-editing card is harder then.

Fetchign embeddings could take awhile and could fail, and sometimes we'd need to do many in bulk, so we'll need some kind of implciit queing system, and flagging cards that have an embedding fetch happening, or need a new one

The functions to compute two similarities between cards instead uses the embeddings. (Maybe have a different type of fingerprint, an EmbeddingFingerprint?)

jkomoros commented 1 year ago

See also https://github.com/dglazkov/wanderer

jkomoros commented 1 year ago

And https://github.com/openai/openai-cookbook/blob/838f000935d9df03e75e181cbcea2e306850794b/examples/Question_answering_using_embeddings.ipynb

jkomoros commented 1 year ago

Create a native polymath endpoint instead of needing to do an export of a json file and resvae

jkomoros commented 1 year ago

https://extensions.dev/extensions/googlecloud/firestore-semantic-search

jkomoros commented 1 year ago

The content that is sent to the embedder should also include the canonical forms of any concept links, to help create semantic connection across multiple synonyms

jkomoros commented 9 months ago

A few challenges: the set of embeddings in production will likely be extremely large, and push the renderer closer to an OOM. There will also be many cases where you don't have the embeddings but still want to do something meaningful.

Have a new set of query filters, meaning, which if embeddings are available, uses a sorting based on cosine similiarty, and if they aren't, falls back to just an alias for similarity.

One design: have the embeddings in a cloud function. Use hnsw to do the index. Store the index in Cloud Storage, and every time you save a new snapshot, remove old copies that are above some count (so keep a few just in case). Every time the cloud function loads (will cloud functions v2 help the instance be reused more often?) it loads the snapshot of the most recent version. We can use Object Versioniong in Cloud Storage, and ifGenerationMatch to check before writing that no edits have been made. If they have, reload the most recent snapshot and try again (try this up to, say, 3 times). Once the write succeeds, also write the information to the embeddings firestore collection (see below). There should be a clean operation that looks for ids in the hsnw index that don't have a corresponding firestore entry and deletes them (otherwise there will be items that might continually show up in queries that have to be filtered out)

hnsw doesn't allow saving metadata so we'll have to do it some other way, including maintaining a mapping from cardID -> hnsw index. That will be stored in a new embeddings firestore collection, which will be keyed off of cardID + embedding_space (allowing adding new ones in the future), like c-123-4567+embedding-ada-002. Each record will have the embedding_index, the last_updated date, a version number for the card extraction version, and a snapshot of the embedded text. Every time a card is saved and its content changes, we check if there is an embedding record, and if there is one if the text is equivalent. If any of these aren't true, then we kick off an embedding request and then store the result in the hsnw index, saving a snapshot, and updating the embedding record. The card extraction version allows us to experiment with new formats for the text to embed, including just cardPlainText (note for content cards, will need to include the title) but also things like including the canonical form of the concept links, as well as a date (which will inherently get a bit of nearby date similarity overlap?). There also needs to be an operation to kick off creating new embedding entries when a new version (e.g. extraction version) is pushed, in addition to the incremental onCardUpdated hook to compute incremental embeddings. Make sure the embeddings collection is not allowed to be downloaded to the client (especially if it contains the full embedded text)

There is also an endpoint, which anyone can hit (because it never sends back card content) that you can pass a key card ID and a k, and it will pass back an array of [CARD_ID, similarity] records of the most similar items. (You can pass -1 to mean 'literally every card'). When you hit the endpoint, the endpoint loads up the hnsw index, looks up the embedding_index of the given card ID, fetches the embedding of that item, fetches the k most similar, and then reverses out the card_id of each one before passing back. You can also pass to the endpoint not a cardID but a card content--useful for doing the similarity of a card as it is being edited. In the future we can filter out any cards the given user doesn't have access to (to not leak the existence of other cards, and to ensure the entire list of records that are passed back aren't, for example, entirely unpublished cards they can't see).

The local filter for meaning will keep track of cached similarity lists of key careds (invalidating the list each time a card is edited). The filter will have a bit of a delay to actually give a result, as it fetches to the endpoint.

The content to be embedded is a function that takes a card and a collection of concept cards and produces a canonical text, which includes the canonical form of every concept card's title appended at the end.

jkomoros commented 8 months ago

Just use Qdrant? (open source pinecone alternative)

Yeah, just use Qdrant, there are a ton of database administration tasks that will be too annoying to do by hand. Also, running a DB in a cloud function and not duplicating the service a million times (with more resource use and also possible for collisions) is likely.

Add a qdrant_api_key and qdrant_url to config.SECRET.json. Document how to set it and what it does. (Warn at gulp file generation if the qdrant key is set and openai key is not). Also have .GENERATED. have a VECTOR_STORE_ENABLED.

Add a client tool that checks if the qdrant_api_key is set, and if so, during the deploy checks to see if the DB is configured (via collection_info), and if not, configures it. Configuration makes the named DB collection (openai.com:text-embedding-ada-002, with dev- prepended for dev_mode) and then adds two indexes, on card_id and version.

The IDs for each point is a card_id+version (verify qdrant doesn't literally require a UUID). This allows us to not have to keep track of a integer index and which one to use for next insert). The payload in the qdrant store is structured like this:

{
  //Indexed
  "card_id": CardID,
  //The version of the content extraction, allowing adding a new one later
  //Indexed
  "version": 0,
  "content" : "<Extracted content>",
  "last_updated": timestamp
}

functions/src/embeddings.ts creares a qdrant client, if the api_key is configured.

There are three endpoints: 1) Re-index any missing cards. Fetches all cards, and then goes through one by one to call updateCard. A HTTPS trigger. 2) updateCardEmedding, just calls updateCard. A firestore trigger. Extracts the text content (bailing early if no content). Then does a getPoint with the computed ID (or scroll with filter card_id, version if the id is UUID) to fetch the payload, and compare the text content. If the text is the same, quit. If not the same, compute the embedding and upsert. 3) The query endpoint. It either takes a card_Id or a card to extract from, computes the embedding (or if it's a card_id tries to fetch it via getPoint(with_vector)) and then does the search, passing a filter of version=${currentVersion}. Then it just extracts the card_id and score and passes those as a tuple. Note that if the IDs are not UUIDs, then it doesn't need to fetch payload or vector, which allows it to request, say, the top 200 similar items reasonably quickly (how high can it go without getting too slow?)

There is also a check during deploy to ask the user if they want to trigger content reindexing (default to false) (or maybe just run it every time, as long as the qdrant_key is configured?)

jkomoros commented 8 months ago