ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
364 stars 79 forks source link

Adapt hocr-wordfreq for non-ASCII symbols #99

Closed zuphilip closed 7 years ago

zuphilip commented 7 years ago

This is tested with the ersch (fraktur) example from ocropy.

stweil commented 7 years ago

Interesting solution, looks good. Do we want to restrict words to letters and numbers (excluding other printable characters)?

zuphilip commented 7 years ago

Do we want to restrict words to letters and numbers (excluding other printable characters)?

Well, I want to exclude interpunctations and anything which does not look like a "word". Yes, to split on \W+ the non-word characters does make sense for me. Do you encounter any problems with the current solution?