jkomoros / card-web

The web app behind thecompendium.cards
Apache License 2.0
53 stars 8 forks source link

Automatically suggest cards that appear to be related #280

Open jkomoros opened 4 years ago

jkomoros commented 4 years ago

I swear this was captured in another issue before.

Look for cards that seem to have overlapping concepts. E.g. tags, or where their rarest words overlap to some degree.

Right now when adding cards and stubs, you have to keep in your head all of the cards you just added so you can interlink them so that there's a web of TODOs to recover. But if the system could help you find those related cards then that would make it a lot easier.

Conceptually each card would be indexed, where each partial word was counted. Then given a card you'd collect all of the other cards that overlap to some degree in the index entries for words in the card. Then you'd rank them by scoring the prevelance of given words that overlap with the target card by their commonality in general.

This would be a very expensive operation, so you'd want to run it only on demand.

Now that partial matches show up in the find dialog, this feels similar to that... just more.

jkomoros commented 4 years ago

Another related feature: for unlinked text in a card, look for the cards that most overlap with the phrase. You'd process in a sliding window of n-grams of different sizes.

This calculation could be called 'semantic overlap' between two cards. The overlap of the 'important' words in the tokenized text strings.

jkomoros commented 4 years ago

Cards should keep track of the words that other cards use to link to them to get at where their title might be able to be configured differntly, or as search terms for other cards to search over

jkomoros commented 4 years ago

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

jkomoros commented 4 years ago
jkomoros commented 4 years ago

Why are certain cards so common?

Very few cards actually overlap in terms of terms. (Even with no constraint on fingerprint length, there's only 11% overlap of words). So I think the common cards are basically cards that don't have rare terms often. That means that most of the fingerprint distance is just the distance of the term from 0.

Maybe take the difference of the set of keys that overlap across cards, and scale by how big that overlap set is?