Idea: Add stemming - Githubissues

mourner / bullshit.js

A bookmarklet for translating marketing speak into human-readable text. :poop:

https://mourner.github.io/bullshit.js/

MIT License

1.86k stars 164 forks source link

Idea: Add stemming #1

Open voz opened 11 years ago

voz commented 11 years ago

Reduce derived word to their stems (stemming) and afterwards match the stems only. It might be more computationally intensive, but the list should become easier to maintain and more bullshit could be discovered.

mourner commented 11 years ago

Agreed! Some words become bullshit only in combination but there are others that definitely should be stemmed, thanks for the idea!

calvinmetcalf commented 11 years ago

Could add a point value to words, or just put them in groups with the same bullshit level, and modify the bs value based on the proximity to other bullshit words i.e. with a threshold of 1, 'monetize' might have 1.2 and always be bullshit, but 'functionality' 0.8 so not bullshit but if 3 words away from 'empowerment', 0.8 then bullshit, 0.8+(0.8/3)=1.07.

mourner commented 11 years ago

Lol, that's awesome idea. :) May be hard to implement though, and tough to assign/maintain the values. Should be discussed in a separate issue I think, quite different from stemming proposal.

voz commented 11 years ago

Yes, but the usual trick here is to come with the right weights. How do we know that "'monetize' might have 1.2" and no 1.875?

On Jan 9, 2013, at 4:47 PM, Calvin Metcalf notifications@github.com wrote:

Could add a point value to words, or just put them in groups with the same bullshit level, and modify the bs value based on the proximity to other bullshit words i.e. with a threshold of 1, 'monetize' might have 1.2 and always be bullshit, but 'functionality' 0.8 so not bullshit but if 3 words away from 'empowerment', 0.8 then bullshit, 0.8+(0.8/3)=1.07.

— Reply to this email directly or view it on GitHub.

calvinmetcalf commented 11 years ago

my bad, was thinking of solutions to the issue of words not bullshit by themselves

voz commented 11 years ago

The idea of weights is a good one, the only thing is that one needs a set of manually classified bullshit texts in order to get the values. But we can discuss it in another issue as @mourner mentioned.

On Jan 9, 2013, at 4:54 PM, Calvin Metcalf notifications@github.com wrote:

my bad, was thinking of solutions to the issue of words not bullshit by themselves

— Reply to this email directly or view it on GitHub.

calvinmetcalf commented 11 years ago

I experemented with some of the available stemming libraries, neither porter stemmer nor Snowball.js are really at a level that is really usable here..