An early version of this library could provide a term-document matrix. As I use EF in my text mining class, it's becoming clear that this wide format needs to exist again: it's a common need that isn't always intuitively done yourself.
[ ] pivot a long document/term/count DataFrame to a wide doc x term matrix.
[ ] support multi-index (e.g. an index that saves both book and page). This may mean tinkering with stack and unstack rather than pivot
[ ] support a supplied column order. This is important if you have a set of documents that you used for training a model (e.g. classifier), then you want to get the term count vector for a new document.
An early version of this library could provide a term-document matrix. As I use EF in my text mining class, it's becoming clear that this wide format needs to exist again: it's a common need that isn't always intuitively done yourself.
doc x term
matrix.stack
andunstack
rather thanpivot