Open tmbdev opened 5 years ago
How do your example files look like? Can you share one here (or at least the beginning of it)?
All examples in the linked discussion seem to have a proper character set declaration, but lxml does not recognize the short form <meta charset='utf-8'>
. Do you face the same problem?
Many HTML files do not contain proper character set declarations, but we still need to be able to read them. LXML is a bit too picky and fails when such files are opened with:
doc = html.parse(args.file)
See the discussion here for how to fix it:
https://stackoverflow.com/questions/15302125/html-encoding-and-lxml-parsing
I'm going to see whether I can add a fix for this.