jkomoros / card-web

The web app behind thecompendium.cards
Apache License 2.0
46 stars 8 forks source link

Performance when committing or creating a new card is slow on very large card webs #694

Open jkomoros opened 1 month ago

jkomoros commented 1 month ago

It can take up a few seconds to create a new card, or to commit a card that you've just edited.

Looking at a performance trace it looks like the time is dominated in fingerprintForCard. Every time the set of cards changes we throw out and recompute fingerprints for every card. We should memoize it so the work is proportional to the number of changed cards, which should be very small.

In the constructor of FingerprintGenerator, pop out a separate fingerprintForCard, which is memoized, and is passed a card to use (and make sure that it fetches the wordCounts for cards via the memoized wordcountFor Cards)

Part of the reason this is slow is because technically every time a card changes, it changes the tfidf of all of the cards. However, past, say, 1000 cards, it's fine to assume the baseline idfMap is fixed and don't recalculate it every time a card changes (or at least, be OK with vending stale fingerprints that use a technically-out-of-data idfMap to set the priors for each word)

jkomoros commented 1 week ago

Simply make the word clouds in the sidebar be closed by default

You can open them and it stays open, and every time you edit it closes by default

Alternate approach: maybe the pipeline can do what we do for the live word cloud while editing: do a fingerprint on top of a snapshot of the base.

Do a thing like cardsForIDF, and that uses a snapshotting mechanic similar to the other one i used, that only reupdates when a card update that changes more than 5% of cards lands.

Although why is it so slow? The IDF pipeline is heavily cached with word counts, so the work should be something like "fetch almost entirely pre-cached objects and then sum up all word counts and divide"...

The reasons it's slow right now are 1) generating the recomputed fingerprint, which includes processing all of the other cards, and 2) when you save a working notes card the card finisher recomputes the entire fingerprint generator in a blocking way to generate the title.

Also, the "similar cards" pipeline requires computing fingerprints for every card to compare against, and presumably might be hit before the embedding-based similarity shows up

jkomoros commented 1 week ago

An implementation (which is currently SLOWER) is in fingerprint-performance.

Random musing notes to self:

A generateFingerprint is a memorize first arg that takes card, idfMap, fingerprint size and field list. It’s a thin wrapper around new fingerprint

Fingerprint generator stays mostly the same except it gets an idfmap from the cards and that is memoized to return the same thing unless cards added more than 5% of cards since last time.

Make sure the part in card finisher also uses the same caching pipeline (note that the fingerprint generator will have a different field list which will throw out the other fingerprints and require recalculation (since we currently calc every one when fingerprint generator is created) even though only one will actually be necessary and different (and the idfmap will be the same)