learntextvis / textkit

Command line tool for manipulating and analyzing text
MIT License
28 stars 6 forks source link

WIP: Start of TF-IDF #43

Closed vlandham closed 5 years ago

vlandham commented 8 years ago

Some issues:

how to specify the corpus?

In order to do TF-IDF we need each document separate so we can do document frequency. Or we need a compiled representation of the corpus indicating the presence or absence of each character.

Right now, this tfidf function just takes one or more paths to indicate the documents in the corpus.

This is less than optimal because:

User has no control on tokenization & transformation of corpus documents (lowercase? stopwords? etc).

Is there a better way to specify a corpus?

Perhaps another function preparecorpus or something that takes a bunch of docs and turns them into a count based representation?

character encodings

This currently doesn't work on python 2 most likely because of NLTK issues with character encodings. I'm not sure if this affects other functions

acquiring corpora

The NLTK corpora sometimes have additional content in them. Also, they are in different file formats (not UTF-8 for some reason). Right now textkit isn't great about handling non-utf8 formats.

iros commented 8 years ago

The directory format makes sense to me but I'm assuming that's because I don't fully grok the motivation behind the json docs (since I wasn't as much a part of developing the visualizations themselves.)

It looks good otherwise.