jkomoros / card-web

The web app behind thecompendium.cards
Apache License 2.0
46 stars 8 forks source link

Semantic fingerprints should include n-grams #365

Open jkomoros opened 3 years ago

jkomoros commented 3 years ago

The fingerprint system is used in a lot of places, especially now that #353 is done and there are a lot more working notes in the system.

Somee statements are actually more like n-grams, where the fact they show up in a row is very significant. The overall system should treat those n-grams that cooccur as more special.

Originally tracked in #353

jkomoros commented 3 years ago

Also, do the extracted words factor out things like quotes, and treat things like '/' the same as '-' (that is, mostly like a space?) And make sure paranthetical asides, too

jkomoros commented 3 years ago

Semantic fingerprints should also weight titles and inbound reference text more strongly than normal content text

jkomoros commented 3 years ago

There are some distinctive but not content-ful words that stand out in fingerprints but aren't important. (Although as the corpus of cards gets larger and larger, the distinctive words will pop out less and less)

jkomoros commented 3 years ago

In general, the text processing is a bit random, ad-hoc, and hard to work on without breaking something. Everything that has to do with natural language processing, stemming, etc should be factored out into nlp.js and it should be tested.

We do some weird joining for e.g. multiple inbound links together. Ideally within a given property, there would be an array of strings where each string was matched separately, so that words on the end of one sub-string didn't match with the beginning of the next sub-string. This would also ideally be used for sentence breaks and block level breaks (e.g. paragraph, ul, li). And then things inside of quotes or parantheses could also be a separate sub-string.

There's specialized machinery in query matching now that rewards string matches that have a dash in them if the query also did. But having the right boundaries between sub-strings, combined with processing and storing bigrams, should be sufficient.

jkomoros commented 3 years ago

element.innerText already puts in \n for block level element breaks

jkomoros commented 3 years ago