WIP: Start of TF-IDF - Githubissues

Some issues:

how to specify the corpus?

In order to do TF-IDF we need each document separate so we can do document frequency. Or we need a compiled representation of the corpus indicating the presence or absence of each character.

Right now, this tfidf function just takes one or more paths to indicate the documents in the corpus.

This is less than optimal because:

User has no control on tokenization & transformation of corpus documents (lowercase? stopwords? etc).

Is there a better way to specify a corpus?

Perhaps another function preparecorpus or something that takes a bunch of docs and turns them into a count based representation?

character encodings

This currently doesn't work on python 2 most likely because of NLTK issues with character encodings. I'm not sure if this affects other functions

acquiring corpora

The NLTK corpora sometimes have additional content in them. Also, they are in different file formats (not UTF-8 for some reason). Right now textkit isn't great about handling non-utf8 formats.

learntextvis / textkit

WIP: Start of TF-IDF #43