Semantic fingerprints should include n-grams

jkomoros commented 3 years ago

The fingerprint system is used in a lot of places, especially now that #353 is done and there are a lot more working notes in the system.

Somee statements are actually more like n-grams, where the fact they show up in a row is very significant. The overall system should treat those n-grams that cooccur as more special.

Originally tracked in #353

jkomoros commented 3 years ago

Also, do the extracted words factor out things like quotes, and treat things like '/' the same as '-' (that is, mostly like a space?) And make sure paranthetical asides, too

jkomoros commented 3 years ago

Semantic fingerprints should also weight titles and inbound reference text more strongly than normal content text

jkomoros commented 3 years ago

There are some distinctive but not content-ful words that stand out in fingerprints but aren't important. (Although as the corpus of cards gets larger and larger, the distinctive words will pop out less and less)

jkomoros commented 3 years ago

In general, the text processing is a bit random, ad-hoc, and hard to work on without breaking something. Everything that has to do with natural language processing, stemming, etc should be factored out into nlp.js and it should be tested.

We do some weird joining for e.g. multiple inbound links together. Ideally within a given property, there would be an array of strings where each string was matched separately, so that words on the end of one sub-string didn't match with the beginning of the next sub-string. This would also ideally be used for sentence breaks and block level breaks (e.g. paragraph, ul, li). And then things inside of quotes or parantheses could also be a separate sub-string.

There's specialized machinery in query matching now that rewards string matches that have a dash in them if the query also did. But having the right boundaries between sub-strings, combined with processing and storing bigrams, should be sufficient.

jkomoros commented 3 years ago

element.innerText already puts in \n for block level element breaks

jkomoros commented 3 years ago

[x] Remove the special behaviro that does the normalization logic multiple times, since dashes aren't maintained anyway
[x] un-break the logic that removes '-' in filternames and queries (that was broken in 01286ab3f676231e7bf4d2311e3fa743aa5ddf59)
[x] Move semantic fingerprint logic into nlp.js
[x] un-export as many things from nlp.js as possible
[x] Add bigrams to preparedQueries
[x] Make the multiple-link-text character be \n (maintence task)
[x] Fix broken test behavior in the first test from broken innerText behavior
[x] test fingerprint
[x] test fingerprint --> expanded semantic fingerprint
[x] Test closestOverlappingItems (first add some overlap between cards)
[x] fingerprints also operate over bigrams
[ ] Try also doing trigrams (with 1/3 as match count) for fingerprints. SEe if the performance/quality tradeoff is worth it (should be a small addition in quality, but might be a LOT more memory)
[ ] Should word cloud coloring be an absolute scale of how much they apply, so that the first item in the word cloud isn't always fully lit?
[ ] Should fingerprints have a minimum tfidf below which the terms are cut off? That would help avoid cards with little content having way more bigrams than other cards. (Alternatively, make the max size be way larger)
[x] cards with an empty body have ' ' in their wordCounts
[x] fingerprints might need to be larger if they include bigrams, since there will be lots more distinctive, non-overlapping bigrams
[ ] remove the trailing ' for stemmed words like alex' (for alex's)
[x] Remove stop words from normalized extracted content (and queries?) Add to,and,of
[x] Run convert-multi-links-delimiter maintenance task in production
[x] The innerText extraction SHOULD be OK because the node is never inserted into the document which is when xss triggers
[x] Pop document into separate file so it can be overruled once by multiple things that need to be able to test it
[x] Is there a way to expand the stop-words auomatically, maybe extracting out the most common stemmed words and skip those? Can also just spot-check the items with lowest idf. Candidates to stop-word: that,you,it,ar,be,on,can,have,for,but
[ ] Items in a single-nested parentheses should be separated into a separate text run (test splitRuns)
[ ] Consider splitting each sentence into a run
[ ] Support quotes in queries for separate runs. (make sure filters inside quotes don't get pulled out)
[ ] Some kind of query loss visualization tool to get a sense for if it's improving or degrading
[x] How is text like "e.g." and links parsed
[x] '/' should be converted to spaces
[ ] sanity check performance (larger fingerprints, and up-to-3 ngrams in wordCountsForSemantics)
[x] innerTextForHTML should strip out newlines and then insert them in after block elements so we don't rely on the html that was saved to be well formed
[x] Split each normalized text fields into an array of runs of text to check (e.g. at newline, or at individual run, like multiple individual inbound links, also within parantheses or quotes. Convert all to \n and then break into pieces on that. Only convert single quotes to newlines if at the beginning or end of a word... and what about quotes as in scare quotes those should effectively just be ignored)
[x] Unrelated bug: if you're on recent tab, the working notes will never show up until you swtich to another tab

jkomoros / card-web

Semantic fingerprints should include n-grams #365