learntextvis / textkit

Command line tool for manipulating and analyzing text
MIT License
28 stars 6 forks source link

Commands that work? Can we edit this and add a list #31

Closed arnicas closed 8 years ago

arnicas commented 8 years ago

Thank you for so much work on this and I'm sorry I've been out of touch... can we keep a list in here of commands that work? or a .md somewhere temporary that is regularly updated until doc is written? I am trying to recreate from code and issues and failing!

For instance, for bigram counting - what is the sequence of pipes? I guess it's not

textkit words2bigrams test_data/pride_and_prejudice.txt | textkit topbigrams

?

vlandham commented 8 years ago

Sorry. the topbigrams command is a little strange - it currently takes tokens. So something like this should work:

textkit text2words test_data/alice.txt | textkit filterpunc | textkit topbigrams

and words2bigrams takes word tokens (hence it starting with words) so you would want to do something like:

textkit text2words test_data/alice.txt | textkit words2bigrams

As far as i know, all the commands work. We just need to work on documentation - and perhaps do some name changing. like this could be tokens2topbigrams or something?

vlandham commented 8 years ago

The naming scheme so far is as follows:

Names that don't fit this naming system:

The transform functions:

These could be:

or

We have two utility functions right now:

arnicas commented 8 years ago

Some recipes:

[Note: general issue that there are lots of stray single quotes/apostrophes in the data for a variety of reasons. Maybe we should have a simple way to clean those up.]

Tokenizing

Stopwords

Word counting

Bigrams

[I'm a little weirded out not getting counts out of these topbigrams results, or some other measure.]

POS

Note: this is a weird result -- did I do this wrong?

"Alice,NNP",390
"Queen,NNP",71
"King,NNP",60
"Turtle,NNP",58
"Mock,NNP",56
"Gryphon,NNP",54
"*,NNP",54

Tips:

Add to your custom stop list from the command line:

echo "--" >> custom/mystops.txt
vlandham commented 8 years ago

i think this is a great list.

I'm going to move this to its own issue: #34

I will also add the 'filter small words' as a potential command.