htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
37 stars 12 forks source link

Books with high word/page count not reading tokens #10

Closed organisciak closed 7 years ago

organisciak commented 7 years ago

The Feature Reader made an assumption (gasp!) about the structure of a book that turned out to be incorrect: that the total number of unique tokens would not be greater than page_count * 2000. Work on Bookworm found cases where this assumption broke.

Fix is done (https://github.com/htrc/htrc-feature-reader/commit/7377eee96e4428d765ac42ee1b1f2cb4eb6fe195) and tests have been written, so this just needs to be integrated into the main branch and prepared for PyPi and conda.