Performance when committing or creating a new card is slow on very large card webs

jkomoros commented 1 month ago

It can take up a few seconds to create a new card, or to commit a card that you've just edited.

Looking at a performance trace it looks like the time is dominated in fingerprintForCard. Every time the set of cards changes we throw out and recompute fingerprints for every card. We should memoize it so the work is proportional to the number of changed cards, which should be very small.

In the constructor of FingerprintGenerator, pop out a separate fingerprintForCard, which is memoized, and is passed a card to use (and make sure that it fetches the wordCounts for cards via the memoized wordcountFor Cards)

Part of the reason this is slow is because technically every time a card changes, it changes the tfidf of all of the cards. However, past, say, 1000 cards, it's fine to assume the baseline idfMap is fixed and don't recalculate it every time a card changes (or at least, be OK with vending stale fingerprints that use a technically-out-of-data idfMap to set the priors for each word)

[ ] Lazy compute card fingerprints (test to make sure this does make things faster)
[ ] The baseline wordcounts for the corpus should be behind a memoization that only updates it if the number of cards increases by 10% or more. This will deliberately use stale wordcounts if cards change (since the count won't change) or when a single card is created, but still update the tfidf when a large batch of cards is downloaded
[ ] Card fingerprint should be memoized based on the card obj and wordcount ob
[ ] Make selectActiveCollectionWordCloud much faster by cheating, for example only updating the first time the collection changes, but not caring if the cards in the collection change) (or make it so that wordcloud doesn't show up when the overflow menu is expanded, which makes any edit slow)

jkomoros commented 1 week ago

Simply make the word clouds in the sidebar be closed by default

You can open them and it stays open, and every time you edit it closes by default

Alternate approach: maybe the pipeline can do what we do for the live word cloud while editing: do a fingerprint on top of a snapshot of the base.

Do a thing like cardsForIDF, and that uses a snapshotting mechanic similar to the other one i used, that only reupdates when a card update that changes more than 5% of cards lands.

Although why is it so slow? The IDF pipeline is heavily cached with word counts, so the work should be something like "fetch almost entirely pre-cached objects and then sum up all word counts and divide"...

The reasons it's slow right now are 1) generating the recomputed fingerprint, which includes processing all of the other cards, and 2) when you save a working notes card the card finisher recomputes the entire fingerprint generator in a blocking way to generate the title.

Also, the "similar cards" pipeline requires computing fingerprints for every card to compare against, and presumably might be hit before the embedding-based similarity shows up

jkomoros commented 1 week ago

An implementation (which is currently SLOWER) is in fingerprint-performance.

Random musing notes to self:

A generateFingerprint is a memorize first arg that takes card, idfMap, fingerprint size and field list. It’s a thin wrapper around new fingerprint

Fingerprint generator stays mostly the same except it gets an idfmap from the cards and that is memoized to return the same thing unless cards added more than 5% of cards since last time.

Make sure the part in card finisher also uses the same caching pipeline (note that the fingerprint generator will have a different field list which will throw out the other fingerprints and require recalculation (since we currently calc every one when fingerprint generator is created) even though only one will actually be necessary and different (and the idfmap will be the same)

jkomoros / card-web

Performance when committing or creating a new card is slow on very large card webs #694