Automatically suggest cards that appear to be related

jkomoros commented 4 years ago

I swear this was captured in another issue before.

Look for cards that seem to have overlapping concepts. E.g. tags, or where their rarest words overlap to some degree.

Right now when adding cards and stubs, you have to keep in your head all of the cards you just added so you can interlink them so that there's a web of TODOs to recover. But if the system could help you find those related cards then that would make it a lot easier.

Conceptually each card would be indexed, where each partial word was counted. Then given a card you'd collect all of the other cards that overlap to some degree in the index entries for words in the card. Then you'd rank them by scoring the prevelance of given words that overlap with the target card by their commonality in general.

This would be a very expensive operation, so you'd want to run it only on demand.

Now that partial matches show up in the find dialog, this feels similar to that... just more.

jkomoros commented 4 years ago

Another related feature: for unlinked text in a card, look for the cards that most overlap with the phrase. You'd process in a sliding window of n-grams of different sizes.

This calculation could be called 'semantic overlap' between two cards. The overlap of the 'important' words in the tokenized text strings.

jkomoros commented 4 years ago

Cards should keep track of the words that other cards use to link to them to get at where their title might be able to be configured differntly, or as search terms for other cards to search over

jkomoros commented 4 years ago

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

jkomoros commented 4 years ago

[x] create a selectIDFMap that is a reselector of an IDF map based on all normalized search string properties of all cards in the set.
[x] Scale the query matching value by IDF automatically for normal queries
[x] create a selectSemanticFingerprintMap which is a map of cardId -> map(normalized-term) -> tf-idf, filtering to only have the top N (25? 50?) highest terms, with the key order in decreasing order of tf-idf (allowing short-circuiting of other overlap calculations)
[x] Create a semanticOverlap(card, other) which returns the sum of the diff of all of the terms of the semantic fingerprint for each card. (or maybe sum of quares of diff, similar to euclidian distance)
[x] word normalizing doesn't remove trailing ")" (or presumably leading "(") when normalizing words (and also quotes)
[x] Make the selectors that can be, be non-exported
[x] Make sure that if we didn't exclude links and inbound_links then the overlapping cards would be better suggestions
[x] Multiply distance by how much UNDER the fingerprint size the fingerprint is, so short cards are penalized (but cards with full fingerprints that just don't overlap much are still penalized)
[x] Increase size of fingerprints to be larger (and maybe include the distance between overlapping keys in terms of sorted order, with higher being worse?)
[ ] Some way to hide suggested cards that are much worse? As in, when there's a drop off from OK to bad suggestions, stop showing. As in, don't always show 5 suggested cards.
[ ] Score goodness of overlap algorithm by seeing how many of links/inbound links are predicted
[ ] Copy title, inbound links a few times more than body since they're more important
[x] Deemphasize number of times a word happens (seems to override the IDF)
[x] Figure out why blank cards are ranking so high in similar cards
[x] Figure out why certain cards (e.g. you will miss insights from people you outrank and the developers will find a way) keep ranking so high
[x] verify that the expensive IDF calculations aren't done for users who may not edit
[x] figure out a way to do stemming / partial overlap in fingerprints. https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
[x] Create a semanticallyOverlappingCards(card, n) which returns the set of the n most overlapping cards with the given card
[x] Add an overlappingCards section to card info panel that, once a card is loaded, fetches the overlapping cards. It should filter out cards already in links and inbound links. Only select that information if the user has edit privileges, since it will be expensive to generate that index.
[x] make sure if the info panel isn't showing (e.g. in mobile mode) it doesn't select the expensive things
[x] tags should also get a semantic fingerprint (sum up all fingerprints for cards in the tag, then resort, then trim down to fingerprint size), and that should be used to suggest tags for cards.
[x] Clicking a suggested tag should add it, not navigate
[ ] We should not show selected tags if the remaining ones are below some threshold
[x] suggested tags should update when the editing card has a tag added or removed
[x] Suggested tags should be based on the actual content of the editing card, not on the card as saved in the db (otherwise when you first start working on a card you won't get good suggestions until it's been saved)
[x] Unexport semantic selectors that can be private
[x] Overlap of higher keys in fingerprint count more. Even just Log(len(fingerprintOne) - indexOfMatchInOne + len(fingerprintTwo) - indexOfMatchInTwo) . Ideally would still have the outcome be between 0 and 1, but that's not extremely important (just scale by the maximum possible score if every single tag matched in order)
[ ] Allow a show more / show fewer option for similar cards in the UI
[x] Similar cards should update while you're editing, too, to make it easier to find cards to link while you're editing
[x] Use the same stemming / normalizing in search queries

jkomoros commented 4 years ago

Why are certain cards so common?

Very few cards actually overlap in terms of terms. (Even with no constraint on fingerprint length, there's only 11% overlap of words). So I think the common cards are basically cards that don't have rare terms often. That means that most of the fingerprint distance is just the distance of the term from 0.

Maybe take the difference of the set of keys that overlap across cards, and scale by how big that overlap set is?

jkomoros / card-web

Automatically suggest cards that appear to be related #280