Use OpenAI embeddings for similarity

jkomoros commented 1 year ago

Use https://beta.openai.com/docs/guides/embeddings/use-cases to calculate similarity.

If an openai_secret_key is provided in config.SECRET.json then it activates embedding-based similarity.

A new cloud function is set up that when a card's title or body is modified it's flagged to have its embedding fetched. (The fetching has to happen serverside to protect the secret key). By having it be driven off of cards being edited, we could dratt off the firestore permissions to make it hard to abuse our embedding secret key. Getting an embedding for a live-editing card is harder then.

Fetchign embeddings could take awhile and could fail, and sometimes we'd need to do many in bulk, so we'll need some kind of implciit queing system, and flagging cards that have an embedding fetch happening, or need a new one

The functions to compute two similarities between cards instead uses the embeddings. (Maybe have a different type of fingerprint, an EmbeddingFingerprint?)

jkomoros commented 1 year ago

And https://github.com/openai/openai-cookbook/blob/838f000935d9df03e75e181cbcea2e306850794b/examples/Question_answering_using_embeddings.ipynb

jkomoros commented 1 year ago

Create a native polymath endpoint instead of needing to do an export of a json file and resvae

jkomoros commented 1 year ago

https://extensions.dev/extensions/googlecloud/firestore-semantic-search

jkomoros commented 1 year ago

The content that is sent to the embedder should also include the canonical forms of any concept links, to help create semantic connection across multiple synonyms

jkomoros commented 9 months ago

A few challenges: the set of embeddings in production will likely be extremely large, and push the renderer closer to an OOM. There will also be many cases where you don't have the embeddings but still want to do something meaningful.

Have a new set of query filters, meaning, which if embeddings are available, uses a sorting based on cosine similiarty, and if they aren't, falls back to just an alias for similarity.

One design: have the embeddings in a cloud function. Use hnsw to do the index. Store the index in Cloud Storage, and every time you save a new snapshot, remove old copies that are above some count (so keep a few just in case). Every time the cloud function loads (will cloud functions v2 help the instance be reused more often?) it loads the snapshot of the most recent version. We can use Object Versioniong in Cloud Storage, and ifGenerationMatch to check before writing that no edits have been made. If they have, reload the most recent snapshot and try again (try this up to, say, 3 times). Once the write succeeds, also write the information to the embeddings firestore collection (see below). There should be a clean operation that looks for ids in the hsnw index that don't have a corresponding firestore entry and deletes them (otherwise there will be items that might continually show up in queries that have to be filtered out)

hnsw doesn't allow saving metadata so we'll have to do it some other way, including maintaining a mapping from cardID -> hnsw index. That will be stored in a new embeddings firestore collection, which will be keyed off of cardID + embedding_space (allowing adding new ones in the future), like c-123-4567+embedding-ada-002. Each record will have the embedding_index, the last_updated date, a version number for the card extraction version, and a snapshot of the embedded text. Every time a card is saved and its content changes, we check if there is an embedding record, and if there is one if the text is equivalent. If any of these aren't true, then we kick off an embedding request and then store the result in the hsnw index, saving a snapshot, and updating the embedding record. The card extraction version allows us to experiment with new formats for the text to embed, including just cardPlainText (note for content cards, will need to include the title) but also things like including the canonical form of the concept links, as well as a date (which will inherently get a bit of nearby date similarity overlap?). There also needs to be an operation to kick off creating new embedding entries when a new version (e.g. extraction version) is pushed, in addition to the incremental onCardUpdated hook to compute incremental embeddings. Make sure the embeddings collection is not allowed to be downloaded to the client (especially if it contains the full embedded text)

There is also an endpoint, which anyone can hit (because it never sends back card content) that you can pass a key card ID and a k, and it will pass back an array of [CARD_ID, similarity] records of the most similar items. (You can pass -1 to mean 'literally every card'). When you hit the endpoint, the endpoint loads up the hnsw index, looks up the embedding_index of the given card ID, fetches the embedding of that item, fetches the k most similar, and then reverses out the card_id of each one before passing back. You can also pass to the endpoint not a cardID but a card content--useful for doing the similarity of a card as it is being edited. In the future we can filter out any cards the given user doesn't have access to (to not leak the existence of other cards, and to ensure the entire list of records that are passed back aren't, for example, entirely unpublished cards they can't see).

The local filter for meaning will keep track of cached similarity lists of key careds (invalidating the list each time a card is edited). The filter will have a bit of a delay to actually give a result, as it fetches to the endpoint.

The content to be embedded is a function that takes a card and a collection of concept cards and produces a canonical text, which includes the canonical form of every concept card's title appended at the end.

jkomoros commented 8 months ago

Just use Qdrant? (open source pinecone alternative)

Yeah, just use Qdrant, there are a ton of database administration tasks that will be too annoying to do by hand. Also, running a DB in a cloud function and not duplicating the service a million times (with more resource use and also possible for collisions) is likely.

Add a qdrant_api_key and qdrant_url to config.SECRET.json. Document how to set it and what it does. (Warn at gulp file generation if the qdrant key is set and openai key is not). Also have .GENERATED. have a VECTOR_STORE_ENABLED.

Add a client tool that checks if the qdrant_api_key is set, and if so, during the deploy checks to see if the DB is configured (via collection_info), and if not, configures it. Configuration makes the named DB collection (openai.com:text-embedding-ada-002, with dev- prepended for dev_mode) and then adds two indexes, on card_id and version.

The IDs for each point is a card_id+version (verify qdrant doesn't literally require a UUID). This allows us to not have to keep track of a integer index and which one to use for next insert). The payload in the qdrant store is structured like this:

{
  //Indexed
  "card_id": CardID,
  //The version of the content extraction, allowing adding a new one later
  //Indexed
  "version": 0,
  "content" : "<Extracted content>",
  "last_updated": timestamp
}

functions/src/embeddings.ts creares a qdrant client, if the api_key is configured.

There are three endpoints: 1) Re-index any missing cards. Fetches all cards, and then goes through one by one to call updateCard. A HTTPS trigger. 2) updateCardEmedding, just calls updateCard. A firestore trigger. Extracts the text content (bailing early if no content). Then does a getPoint with the computed ID (or scroll with filter card_id, version if the id is UUID) to fetch the payload, and compare the text content. If the text is the same, quit. If not the same, compute the embedding and upsert. 3) The query endpoint. It either takes a card_Id or a card to extract from, computes the embedding (or if it's a card_id tries to fetch it via getPoint(with_vector)) and then does the search, passing a filter of version=${currentVersion}. Then it just extracts the card_id and score and passes those as a tuple. Note that if the IDs are not UUIDs, then it doesn't need to fetch payload or vector, which allows it to request, say, the top 200 similar items reasonably quickly (how high can it go without getting too slow?)

There is also a check during deploy to ask the user if they want to trigger content reindexing (default to false) (or maybe just run it every time, as long as the qdrant_key is configured?)

jkomoros commented 8 months ago

[x] Fix the error where the card ID is wrong on the card object when the embedding calculation runs
[x] Include a short date of the card in the embedding (experiment)
[x] Consider renaming payload.version to extraction_version (requires updating indexes)
[x] Switch to v2 functions for higher timeout
[x] On deploy, if qdrant_enabled, hit the reindexCardEmbeddings endpoint
[x] Include card_type and card_created timestamp (for visualization)
[x] title should come before body
[x] date / card type shoudl come at end
[x] set timeout of 540 seconds
[x] Include date / card type in the body
[x] fetchCardSimilarity should be kicked off by a similarCard with a new keycard
[x] Make reindex-card-embeddings not print errors to the CLI (which can interfere with for example text editors live in that CLI), but to a log file
[x] When a card is first saved, it has a no-such-embedding error, instead of a 'stale-embedding'. That then leads to cards having the same wrong similarity (why is it always the same wrong one) until the next card is saved.
- [ ] No, that's the content for an empty card. There's a race, when you save the card, it doesn't clear out the cardSimilarity when the card is updated. Or it does, and the problem is if you save new content under 10 seconds (LAST_UPDATED_EPSILON on the server) then it happily reports the old value as not stale.
- [ ] What happens is you create an (empty) card. It gets a generic embedding and a last_updated timestamp. If you then save it within 10 seconds (LAST_UPDATED_EPSILON) and then reach out for similarCards then it says "well the embedding I have is new enough" and returns it, even though it will (soon) be updated. The next time the cards are cleared (when the next card is created/updated) the new embedding is fetched and it works great.
[x] While editing new cards, for now just always fall back to tfidf similarity (until we update)
[ ] Figure out how to protect against ddos kind of attacks on reindexCardEmbeddings endpoint
[ ] Bail earlier for when cards are changed where the change doesn't include any properties that might affect the embedding content (e.g. concept cards where all that is changed is references_inbound)
[x] the card_created field in qdrant.Point is suspiciously low precision
[x] Figure out how to get the endpoint for reindexCardEmbeddings given v2 is at a different URL (document, and allow gulp task to get it)
[x] Add a lastUpdated to PointPayload
[x] When a card is edited, if similarity is fetched just then, it will almost certainly miss the new index value. Maybe when the card is edited add a fetchCardSimilarity for, say, 10 seconds later? Ideally we'd run it immediately after it's available in qdrant, but we don't really have a callback for that.
[ ] While editing the card, live update the similarity every few seconds (just like the normal pipeline)
[ ] For cards thare are below the ~500 or so similarity cutoff, take the tfidf similarity and smear it (below the lowest similarity from the qdrant pipeline)
[ ] Experiment with sticking the canonical text of the concepts at the end of the embedding text
[x] Have some kind of loading indicator when a similarity request is ongoing so it's not as weird when it pops in and out?
[x] If there are TONS of failures in reindexCardEmbedding break. It should only survive through say 5 in a row. (it should be rare, like a card content being too long, and not persistent, like openai being down)
[ ] Consider having a clean_old_versions endpoint, or do it automatically on reindexCardEmbeddings?
[x] Set the qdrant config functions values (eg cluster_url, api_key) automatically via gulp task, not manually
[x] Enable Object Versioning on new buckets when being created in gulp gsutil task
[x] When saving hnsw index, fail if the generation doesn't match (ideally we'd then reapply the stuff that has changed since then on the newly fetched index)
[ ] Use gpt-tok to clip overly long embeddings before embedding
[ ] make textContentForEmbeddingForCard function be smarter and more like cardPlainContent
[x] Make everything work for non-content and working-notes cards
[x] Actually run processCard when a card changes
[x] Create the endpoint to get the most similar cards to a given card
[x] Create the endpoint to get the most similar cards to a given extracted text
[x] Wire in the meaning filter type locally and see how it feels
[x] Figure out if everything that touches hnsw should be one cloud function that hte other things just proxy to (because otherwise every instance of a cloud function that does something with the embedding store might load up a different version and stomp on each other's edits)
[ ] There's a bug where '?DEFAULT-INVALID-ID?' has been stored on a lot of cards for a long time, likely from refactoring card updates and diffs etc (#672 )
[x] Card processing function silently exits if openai key is not set (or is set to change-me sentinel)
[x] Have a 'embedCards' endpoint that triggers making embeddings for every card that can be run to calculate all embeddings for all cards and store them.
[ ] If cardSimiliarity is called on a card that exists but does not have an embedding, calculate it on the spot. (This can happen if for example a updateCardEmbedding had an error).
[x] Actually store hsnw index toox

jkomoros / card-web

Use OpenAI embeddings for similarity #646