Open iros opened 8 years ago
proposal:
write scripts as executable unix-like tools - each with a consistent input/output format - under an 'umbrella' suite - similar to csvkit .
Lynn's joke name would be a good one: nplkit or maybe just textkit
Tools should be pipe-able as much as possible, which means they should attempt to read and write in a consistent format (as close to text as possible).
Example commands (we will pick better names - this is just to get a feel for the input and output):
tokenize-word
input: document of text
input: configuration.
output: text - each token on a new line
tokenize-sentence
input: document of text
input: configuration.
output: text - each token on a new line
remove-stopwords
input: list of tokens, each token on a new line
input: list of stopwords, each stopword on a new line
input: configuration
output: list of tokens with stopwords removed
tf-idf
input: list of tokens. each token on a new line
input: configuration
output: list of tokens and scores, each token on a new line. scores separated by comma or tab
windowed-count
input: list of tokens. each token on a new line
input: list of search tokens, each token on a new line
input: configuration
output: rows of counts. Each row is a window of 'time' and there is a column for each token in the search tokens. separated by comma or tab.
May want an option for the output from the text cleaning (and other relevant non-counting operations) to be a single string of space-separated words, too; for some tools/interfaces, that's how a document looks. Might make this tool more general.
Added code snippets in Python dir for how I got bigram frequencies out of NLTK and psuedo-codeish explanation of how I merged unigrams with bigrams and their counts for a wordcloud using both.
Added code samples for tfidf on a TextCollection in nltk and get_sentiment_chunk.py for checking against word lists - in python dir.
General:
Tooling: