The Feature Reader made an assumption (gasp!) about the structure of a book that turned out to be incorrect: that the total number of unique tokens would not be greater than page_count * 2000. Work on Bookworm found cases where this assumption broke.
The Feature Reader made an assumption (gasp!) about the structure of a book that turned out to be incorrect: that the total number of unique tokens would not be greater than page_count * 2000. Work on Bookworm found cases where this assumption broke.
Fix is done (https://github.com/htrc/htrc-feature-reader/commit/7377eee96e4428d765ac42ee1b1f2cb4eb6fe195) and tests have been written, so this just needs to be integrated into the main branch and prepared for PyPi and conda.