Closed arnicas closed 8 years ago
Sorry. the topbigrams command is a little strange - it currently takes tokens. So something like this should work:
textkit text2words test_data/alice.txt | textkit filterpunc | textkit topbigrams
and words2bigrams
takes word tokens (hence it starting with words
) so you would want to do something like:
textkit text2words test_data/alice.txt | textkit words2bigrams
As far as i know, all the commands work. We just need to work on documentation - and perhaps do some name changing. like this could be tokens2topbigrams
or something?
The naming scheme so far is as follows:
filter
text
words
sentences
(we dont have any like this)Names that don't fit this naming system:
topbigrams
-- should be tokens2topbigrams
The transform functions:
lowercase
nonewlines
uppercase
These could be:
transformcase
transformnewlines
or
tokens2downcase
tokens2upcase
text2nonewlines
We have two utility functions right now:
download
showstops
Some recipes:
[Note: general issue that there are lots of stray single quotes/apostrophes in the data for a variety of reasons. Maybe we should have a simple way to clean those up.]
[I'm a little weirded out not getting counts out of these topbigrams results, or some other measure.]
Note: this is a weird result -- did I do this wrong?
"Alice,NNP",390
"Queen,NNP",71
"King,NNP",60
"Turtle,NNP",58
"Mock,NNP",56
"Gryphon,NNP",54
"*,NNP",54
Add to your custom stop list from the command line:
echo "--" >> custom/mystops.txt
i think this is a great list.
I'm going to move this to its own issue: #34
I will also add the 'filter small words' as a potential command.
Thank you for so much work on this and I'm sorry I've been out of touch... can we keep a list in here of commands that work? or a .md somewhere temporary that is regularly updated until doc is written? I am trying to recreate from code and issues and failing!
For instance, for bigram counting - what is the sequence of pipes? I guess it's not
?