Closed theobayard closed 4 years ago
fix this._wordPattern in LDAModel to \p{L}(\p{P}?\p{L})+ Play around with pattern in a RegEx Tester to verify first
\p{L}(\p{P}?\p{L})+ does not capture single letter words like "I" and "a" \p{L}(\p{P}?\p{L})* might be better
This is actually by design: for English, 1 letter words are just about always stopwords, so it's easier to just remove them. But you're welcome to adjust it to make it more predictable (it may even be worth having some "advanced" flexibility for users?)
Tokens have punctuation in them. Is that supposed to happen? It makes file downloads weird. It only seems to happen in the middle of a "token". For example hero".[a and ente…"(mine are both in the movie plots token list