Stripword in TopicDoc doesn't allow for mid-word punctuation and non-english characters and will have no idea if LDAModel's Regex changes

hmc-whisk / jsLDA

A React based version of jsLDA with brand new features added on

Other

0 stars 0 forks source link

Stripword in TopicDoc doesn't allow for mid-word punctuation and non-english characters and will have no idea if LDAModel's Regex changes #164

Closed theobayard closed 3 years ago

theobayard commented 3 years ago

TopicDoc.stripword works fine for the example datasets we have, but it will make saliency highlighting and document sorting useless for any corpus that relies on symbols that aren't A-Za-z. The whole function should probably be moved inside the LDAModel and made to use or build off the same RegEx that the model uses to parse the document initially. That way our definition of a word will stay consistent.

theobayard commented 3 years ago

This same function is in DocView as well. It shouldn't be in two places. I think it would fit best in LDAModel. It goes hand in hand with getWordTopicValue, which also shows up in both classes. Best to move both of them to LDAModel at the same time