Open jkomoros opened 3 years ago
Also, do the extracted words factor out things like quotes, and treat things like '/' the same as '-' (that is, mostly like a space?) And make sure paranthetical asides, too
Semantic fingerprints should also weight titles and inbound reference text more strongly than normal content text
There are some distinctive but not content-ful words that stand out in fingerprints but aren't important. (Although as the corpus of cards gets larger and larger, the distinctive words will pop out less and less)
In general, the text processing is a bit random, ad-hoc, and hard to work on without breaking something. Everything that has to do with natural language processing, stemming, etc should be factored out into nlp.js
and it should be tested.
We do some weird joining for e.g. multiple inbound links together. Ideally within a given property, there would be an array of strings where each string was matched separately, so that words on the end of one sub-string didn't match with the beginning of the next sub-string. This would also ideally be used for sentence breaks and block level breaks (e.g. paragraph, ul, li). And then things inside of quotes or parantheses could also be a separate sub-string.
There's specialized machinery in query matching now that rewards string matches that have a dash in them if the query also did. But having the right boundaries between sub-strings, combined with processing and storing bigrams, should be sufficient.
element.innerText already puts in \n
for block level element breaks
\n
(maintence task)'
for stemmed words like alex'
(for alex's
)to,and,of
that,you,it,ar,be,on,can,have,for,but
\n
and then break into pieces on that. Only convert single quotes to newlines if at the beginning or end of a word... and what about quotes as in scare quotes those should effectively just be ignored)
The fingerprint system is used in a lot of places, especially now that #353 is done and there are a lot more working notes in the system.
Somee statements are actually more like n-grams, where the fact they show up in a row is very significant. The overall system should treat those n-grams that cooccur as more special.
Originally tracked in #353