Open voz opened 11 years ago
Agreed! Some words become bullshit only in combination but there are others that definitely should be stemmed, thanks for the idea!
Could add a point value to words, or just put them in groups with the same bullshit level, and modify the bs value based on the proximity to other bullshit words i.e. with a threshold of 1, 'monetize' might have 1.2 and always be bullshit, but 'functionality' 0.8 so not bullshit but if 3 words away from 'empowerment', 0.8 then bullshit, 0.8+(0.8/3)=1.07.
Lol, that's awesome idea. :) May be hard to implement though, and tough to assign/maintain the values. Should be discussed in a separate issue I think, quite different from stemming proposal.
Yes, but the usual trick here is to come with the right weights. How do we know that "'monetize' might have 1.2" and no 1.875?
On Jan 9, 2013, at 4:47 PM, Calvin Metcalf notifications@github.com wrote:
Could add a point value to words, or just put them in groups with the same bullshit level, and modify the bs value based on the proximity to other bullshit words i.e. with a threshold of 1, 'monetize' might have 1.2 and always be bullshit, but 'functionality' 0.8 so not bullshit but if 3 words away from 'empowerment', 0.8 then bullshit, 0.8+(0.8/3)=1.07.
— Reply to this email directly or view it on GitHub.
my bad, was thinking of solutions to the issue of words not bullshit by themselves
The idea of weights is a good one, the only thing is that one needs a set of manually classified bullshit texts in order to get the values. But we can discuss it in another issue as @mourner mentioned.
On Jan 9, 2013, at 4:54 PM, Calvin Metcalf notifications@github.com wrote:
my bad, was thinking of solutions to the issue of words not bullshit by themselves
— Reply to this email directly or view it on GitHub.
I experemented with some of the available stemming libraries, neither porter stemmer nor Snowball.js are really at a level that is really usable here..
Reduce derived word to their stems (stemming) and afterwards match the stems only. It might be more computationally intensive, but the list should become easier to maintain and more bullshit could be discovered.