Commands that work? Can we edit this and add a list

arnicas commented 8 years ago

Thank you for so much work on this and I'm sorry I've been out of touch... can we keep a list in here of commands that work? or a .md somewhere temporary that is regularly updated until doc is written? I am trying to recreate from code and issues and failing!

For instance, for bigram counting - what is the sequence of pipes? I guess it's not

textkit words2bigrams test_data/pride_and_prejudice.txt | textkit topbigrams

?

vlandham commented 8 years ago

Sorry. the topbigrams command is a little strange - it currently takes tokens. So something like this should work:

textkit text2words test_data/alice.txt | textkit filterpunc | textkit topbigrams

and words2bigrams takes word tokens (hence it starting with words) so you would want to do something like:

textkit text2words test_data/alice.txt | textkit words2bigrams

As far as i know, all the commands work. We just need to work on documentation - and perhaps do some name changing. like this could be tokens2topbigrams or something?

vlandham commented 8 years ago

The naming scheme so far is as follows:

if it is capable of removing tokens, the command should start with filter
If it works on raw text, the command should start with text
If it works on word based tokens, the command should start with words
If it works on sentence based tokens, the command would start with sentences (we dont have any like this)
we don't have any consistent naming for commands that transform text or tokens - but we probably should

Names that don't fit this naming system:

topbigrams -- should be tokens2topbigrams

The transform functions:

lowercase
nonewlines
uppercase

These could be:

transformcase
transformnewlines

or

tokens2downcase
tokens2upcase
text2nonewlines

We have two utility functions right now:

download
showstops

arnicas commented 8 years ago

Some recipes:

[Note: general issue that there are lots of stray single quotes/apostrophes in the data for a variety of reasons. Maybe we should have a simple way to clean those up.]

Tokenizing

Split into words and punct, and remove punctuation: textkit text2words test_data/alice.txt | textkit filterpunc | more
Split and lowercase: textkit text2words test_data/alice.txt | textkit lowercase - | more

Stopwords

Split into words and punc, remove punct, remove default stopwords: textkit text2words test_data/alice.txt | textkit filterpunc | textkit filterwords
Tokenize, remove punc, remove default stops and a custom list: textkit text2words test_data/alice.txt | textkit filterpunc | textkit filterwords --custom custom/stop.txt | more
Same, plus lowercase it all: textkit text2words test_data/alice.txt | textkit filterpunc | textkit filterwords --custom custom/stop.txt | textkit lowercase - | more

Word counting

textkit text2words test_data/alice.txt | textkit filterpunc | textkit filterwords --custom custom/stop.txt | textkit tokens2counts | more

Bigrams

[I'm a little weirded out not getting counts out of these topbigrams results, or some other measure.]

textkit text2words test_data/alice.txt | textkit lowercase - | textkit filterpunc | textkit filterwords --custom custom/stop.txt | textkit topbigrams | more
Get bigrams without stopwords, lowercase: textkit text2words test_data/alice.txt | textkit filterpunc | textkit lowercase - | textkit filterwords --custom custom/stop.txt | textkit words2bigrams | more
Count bigrams after filtering for stopwords: textkit text2words test_data/alice.txt | textkit filterpunc | textkit lowercase - | textkit filterwords --custom custom/stop.txt | textkit words2bigrams | textkit tokens2counts | more

POS

Get parts-of-speech for the doc (tokenize, then POS): textkit text2words test_data/alice.txt | textkit tokens2pos - [note, requires the - or errors]
Get just the NNPs: textkit text2words test_data/alice.txt | textkit tokens2pos - | grep NNP
Count the NNPs: textkit text2words test_data/alice.txt | textkit tokens2pos - | grep NNP | textkit tokens2counts | more

Note: this is a weird result -- did I do this wrong?

"Alice,NNP",390
"Queen,NNP",71
"King,NNP",60
"Turtle,NNP",58
"Mock,NNP",56
"Gryphon,NNP",54
"*,NNP",54

Tips:

Add to your custom stop list from the command line:

echo "--" >> custom/mystops.txt

vlandham commented 8 years ago

i think this is a great list.

I'm going to move this to its own issue: #34

I will also add the 'filter small words' as a potential command.

learntextvis / textkit