jkomoros / card-web

The web app behind thecompendium.cards
Apache License 2.0
53 stars 8 forks source link

A mechanism to suggest missing concepts #417

Open jkomoros opened 3 years ago

jkomoros commented 3 years ago

Originally tracked in #399 but because it's such a rich mechanism it makes sense to track it separately

jkomoros commented 3 years ago
jkomoros commented 3 years ago

Making prettyFingerprintItems work with ngrams, including stopwords, will require radically changing how it works. Today it goes through all words in cardObj (which might be a collection of cards) and counts each and every stemmed word (as well as its non-stemmed variant), and then it processes the ngrams word by word. The results don't have stop words and also might have the wrong destemmed variant added to create an expanded ngram that is nonsense, especially if they're operating over very large collections of cards.

In all cases, we have to assume that the fingerprint was generated based on this cardObj/collection, so every ngram in it has an analogue in the card/collection.

Instead of doing it for every ngram in the collection, instead do it only based on items in the fingerprint. For each item in the fingerprint, look through the normalized/stemmed/destopped text runs for that card and look for literal wordBoundaryOverlap. If it's found in that run, then activate expanded ngram extraction. In the normalized-stemmed items, iterate through them word by word, and when we find the first word in the fingerprint ngram, start matching. Then consume either n stop words (accumulating them), or the next word in the sequence, and if the next word in the sequence isn't the next target word (or a stop word) stop the partial match and continue. We now have a stemmed stop-worded string, and know which word it started at. Look into the normalizedvariant of that run from the offset/length, and that's the extracted, normalized example of that run. Collect those, then use the most common one, and title case it.

A few things that need to happen for this approach: 1) for each card, we need to keep three stages of nlp processing of strings: a) normalized strings, b) stemmed, c) withoutStops. (each phase is based on the earlier one, with an extra processing step). This is actually nice because today in the pipeline we sometimes use withoutStops phase and sometimes want just stemmed, and this way we'd have both set via cardSetNormalizedTextProperties. The 2) thing that needs to change with this approach is that no run is ever filtered out, so that the indexes in each of the three stages of processing all match up.