Analysis Scripts - Githubissues

iros commented 8 years ago

General:

Transformation
- Tf-idf
- Hclust (hierarchical clustering)
- Output from python / R formats to something browser/d3 friendly
Splitting
- Tokenization: word, sentence
- lemmatize and bigrams
- Parts of Speech
Cleaning
- Stop words

Tooling:

Python - Pep8 (for analysis scripts), Try for v3.
JS - We'll have a jslint file (for front end code)

vlandham commented 8 years ago

proposal:

write scripts as executable unix-like tools - each with a consistent input/output format - under an 'umbrella' suite - similar to csvkit .

Lynn's joke name would be a good one: nplkit or maybe just textkit

Tools should be pipe-able as much as possible, which means they should attempt to read and write in a consistent format (as close to text as possible).

Example commands (we will pick better names - this is just to get a feel for the input and output):

tokenize-word input: document of text input: configuration. output: text - each token on a new line

tokenize-sentence input: document of text input: configuration. output: text - each token on a new line

remove-stopwords input: list of tokens, each token on a new line input: list of stopwords, each stopword on a new line input: configuration output: list of tokens with stopwords removed

tf-idf input: list of tokens. each token on a new line input: configuration output: list of tokens and scores, each token on a new line. scores separated by comma or tab

windowed-count input: list of tokens. each token on a new line input: list of search tokens, each token on a new line input: configuration output: rows of counts. Each row is a window of 'time' and there is a column for each token in the search tokens. separated by comma or tab.

arnicas commented 8 years ago

May want an option for the output from the text cleaning (and other relevant non-counting operations) to be a single string of space-separated words, too; for some tools/interfaces, that's how a document looks. Might make this tool more general.

arnicas commented 8 years ago

Added code snippets in Python dir for how I got bigram frequencies out of NLTK and psuedo-codeish explanation of how I merged unigrams with bigrams and their counts for a wordcloud using both.

arnicas commented 8 years ago

Added code samples for tfidf on a TextCollection in nltk and get_sentiment_chunk.py for checking against word lists - in python dir.

learntextvis / code-samples

Analysis Scripts #4