ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
371 stars 79 forks source link

encoding errors #153

Open tmbdev opened 5 years ago

tmbdev commented 5 years ago

Many HTML files do not contain proper character set declarations, but we still need to be able to read them. LXML is a bit too picky and fails when such files are opened with:

doc = html.parse(args.file)

See the discussion here for how to fix it:

https://stackoverflow.com/questions/15302125/html-encoding-and-lxml-parsing

I'm going to see whether I can add a fix for this.

zuphilip commented 5 years ago

How do your example files look like? Can you share one here (or at least the beginning of it)?

All examples in the linked discussion seem to have a proper character set declaration, but lxml does not recognize the short form <meta charset='utf-8'>. Do you face the same problem?