A mechanism to suggest missing concepts

[x] Figure out a way to get 'mind growth' to not show up
[ ] have an expanded stop words to use for missing concepts that includes not, and basically all prepositions? (note that prettyFingerprintItems, if it puts back in stop words, will have to know which stop words set was used... and also potentially the size of generated fingerprints, right?)
[ ] selectConcepts only appears to get published concepts?
[x] Make it so optimism and optimize don't stem to the same thing
[ ] Make it so later supersets aren't knocked out if their cumulative score... meets some threshold ('convex gradient' is knocked out because gradient shows up high, but is sufficiently important to include)
[ ] Why is 'complex adaptive system' not showing up, but 'complex adaptive' is?
[ ] Analyze the items that are knocked out of finalNgrams and see if there are any that should be recovered
[x] Don't allow reversals to show up, filter out based on sorted Ngrams being unique
[x] text runs extracted in nlp processing are never filtered. All machinery learns how to deal with empty runs.
[x] setNormalizedTextPropertiesWithCard stashes the values in nlp.stemmed map (update all callsites)
[x] setNormalizedTextPropertiesWithCard also stashes nlp.substantiveWords
[x] All uses of withoutStopWords use nlp.substantiveWords
[x] setNormalizedTextProperties also sets just normalized strings in nlp map
[x] Card c-599-bed285 in dev index triggers a warning for word cloud
[x] Card c-470-cce840 breaks ngramWithinOther because it includes a special character
[x] The default body text for e.g. concept cards should be skipped as a candidate
[ ] When deciding to not knock out a superset, partially grade it on the cardID overlap, if it's high, keep it (maybe boost high card overlap in original bundle scoring, since it implies that the larger phrase is the more canonical one?)
[x] The prettyFingerprintItem is routinely getting way fewer originalNgrams than were noted during fingerprint bundle generation. Which ones are missing and why?
[ ] 'mind growth' shouldn't show up so high in suggested concepts. It's ranking high because 'innovation in the human mind is growth' is both the title and the inbound text reference a SINGLE time.
[x] Why is 'General Principl' not expanding?
[x] Why is 'Hard To To Change' showing up?
[x] Why is 'Power in the play' showing up?
[x] Why is it 'Abstract ideas being hard' showing up?
[ ] Where are coevolution and boundaries and curious as suggested properties?
[x] Don't title case things like 'a' and 'the' in prettyFingerprintItems
[ ] WordCloud becomes a class, and maintains a reference to the fingerprint that generated it (which allows for the missing concepts word cloud to have special behavor)
[ ] concept candidates that show up often on concept cards should be weighed much more heavily
[ ] Words that are title_alternates of existing cards should also suppress suggested cards ('continous change' for 'flows' visible in production)
[ ] Ideally there'd be some way to turn off the influence of existing concepts (they're REALLY strong when they exist) temporarily to compare them and see how many actual concepts are suggested by the system as an indicator of its quality of recommendations
[ ] If there are multiple candidates in prettyFingerprintItems with the same count, pick the one with the shortest length (which is a proxy for having fewer stop words)
[ ] Often the actual concepts to add are single words that show up in a lot of other ngrams. The current scoring machinery works upwards from individual ngrams and never changes smaller ngram scores. But ideally we'd want to reward ngrams that show up as subsets of lots of other high-ranked ngrams.
[ ] ideally extractOriginalNgramFromRun would somehow extract the pre-normalized text (including punctuation), but that's really hard to figure out which parts line up
[x] Shouldn't fingerprint just store the array of card-likes it was based on so wordCloudFromFingerprint etc can have it? have Fingerprint be a class, and have wordCloud, prettyItems(), prettyDedupedItems, etc hang off it. Each one will have a pointer to the array of card objs it's based on, as well as the generator it came from
[x] prettyFingerprintItems words based on the algorithm described in the comment below on this issue
[x] Get rid of destemmedWordMap (remove/update tests)
[x] The expansion to full words should use the words that are actually in the ngram, not just random destemmed versions of those words (this is a problem in general, but is especilaly a problem with large card sets)
[x] fingerprint.wordCloud should be memoized
[ ] any of the selectWordCloud can just select the fingerprint
[ ] word-cloud item just takes the fingerprint. The fingerprint can then stash extra information, like if it comes from possibleMissingConcepts and should have different behavior
[x] Some way to add back in stop words in to the pretty fingerprint items. That should also be more general machinery?
[x] Some final way to kick off the calculation from the UI. Maybe have it be if you have a magical configurable filter in the collection description, then when you expand the word cloud it should make it very clear in the UI that it's a special word cloud
[ ] The missing concept word cloud should allow creating the concepts via clicking (and ideally removing by xing out)
[ ] Some way to remove items that shouldn't match. Ideally they'd be stored in db, and then we could use the number of actual ones that were predicted, substracted by the ones that were predicted by the ones that were that weren't wanted, to help tune accuracy (although if they exist, then the additionalNgrams get boosted)

Making prettyFingerprintItems work with ngrams, including stopwords, will require radically changing how it works. Today it goes through all words in cardObj (which might be a collection of cards) and counts each and every stemmed word (as well as its non-stemmed variant), and then it processes the ngrams word by word. The results don't have stop words and also might have the wrong destemmed variant added to create an expanded ngram that is nonsense, especially if they're operating over very large collections of cards.

In all cases, we have to assume that the fingerprint was generated based on this cardObj/collection, so every ngram in it has an analogue in the card/collection.

Instead of doing it for every ngram in the collection, instead do it only based on items in the fingerprint. For each item in the fingerprint, look through the normalized/stemmed/destopped text runs for that card and look for literal wordBoundaryOverlap. If it's found in that run, then activate expanded ngram extraction. In the normalized-stemmed items, iterate through them word by word, and when we find the first word in the fingerprint ngram, start matching. Then consume either n stop words (accumulating them), or the next word in the sequence, and if the next word in the sequence isn't the next target word (or a stop word) stop the partial match and continue. We now have a stemmed stop-worded string, and know which word it started at. Look into the normalizedvariant of that run from the offset/length, and that's the extracted, normalized example of that run. Collect those, then use the most common one, and title case it.

A few things that need to happen for this approach: 1) for each card, we need to keep three stages of nlp processing of strings: a) normalized strings, b) stemmed, c) withoutStops. (each phase is based on the earlier one, with an extra processing step). This is actually nice because today in the pipeline we sometimes use withoutStops phase and sometimes want just stemmed, and this way we'd have both set via cardSetNormalizedTextProperties. The 2) thing that needs to change with this approach is that no run is ever filtered out, so that the indexes in each of the three stages of processing all match up.

jkomoros / card-web

A mechanism to suggest missing concepts #417