hmc-whisk / jsLDA

A React based version of jsLDA with brand new features added on
Other
0 stars 0 forks source link

Punctuation in tokens? #120

Closed theobayard closed 4 years ago

theobayard commented 4 years ago

Tokens have punctuation in them. Is that supposed to happen? It makes file downloads weird. It only seems to happen in the middle of a "token". For example hero".[a and ente…"(mine are both in the movie plots token list

theobayard commented 4 years ago

fix this._wordPattern in LDAModel to \p{L}(\p{P}?\p{L})+ Play around with pattern in a RegEx Tester to verify first

theobayard commented 4 years ago

\p{L}(\p{P}?\p{L})+ does not capture single letter words like "I" and "a" \p{L}(\p{P}?\p{L})* might be better

xandaschofield commented 4 years ago

This is actually by design: for English, 1 letter words are just about always stopwords, so it's easier to just remove them. But you're welcome to adjust it to make it more predictable (it may even be worth having some "advanced" flexibility for users?)