ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
364 stars 79 forks source link

More options for hocr-wordfreq #101

Closed zuphilip closed 7 years ago

zuphilip commented 7 years ago

We discussed more options for hocr-wordfreq:

  1. An option for splitting on spaces only, which will then also words containing punctations. This is actually what is used for tesseract and therefore there is a use case for this as well.
  2. An option for undo the hyphens at the line ends. This also needs to delete the newline symbols before counting the frequencies. Moreover, possible blank lines should also be deleted.