An option for splitting on spaces only, which will then also words containing punctations. This is actually what is used for tesseract and therefore there is a use case for this as well.
An option for undo the hyphens at the line ends. This also needs to delete the newline symbols before counting the frequencies. Moreover, possible blank lines should also be deleted.
We discussed more options for
hocr-wordfreq
: