learntextvis / textkit

Command line tool for manipulating and analyzing text
MIT License
28 stars 6 forks source link

basic stemming feature #38

Closed justalfred closed 8 years ago

justalfred commented 8 years ago

@vlandham Addresses #1. The following commands work:

textkit text2words test_data/alice.txt | textkit stem --algorithm=porter | less
textkit text2words test_data/alice.txt | textkit stem --algorithm=lancaster | less
textkit text2words test_data/alice.txt | textkit stem --algorithm=SnOwBaLL | less

The algorithm string gets lowercased before comparing. Invalid algorithm selection leads to passing through the original token. I feel like a warning should be raised, but I'm not sure how to do that.

I started with the English-language stemmers in NLTK. I left one out because the method to call has a different name. That's simple to handle, so I'll add it in another commit.

I'm happy to apply any style changes that you recommend. Or any changes, really.

justalfred commented 8 years ago

Thank you for the review! I promise I'll push a commit this weekend, but until then, I'm swamped.

vlandham commented 8 years ago

This is great stuff! Thanks again @justalfred ! this looks super great. Let me check on the one open question from @iros and get back to you

vlandham commented 8 years ago

i'm going to merge this PR and we can tweak little details if they come up.

I'll also make a modification to the contributors to add @justalfred and @jennschiffer . Thanks again!

justalfred commented 8 years ago

Thanks, @vlandham Happy to contribute!