thp / urlwatch

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
https://thp.io/2008/urlwatch/
Other
2.81k stars 352 forks source link

newer html2text dropped -utf8 switch #718

Closed mentalstring closed 2 years ago

mentalstring commented 2 years ago

Recently our system upgraded html2text from 1.3.2a to 2.1.1 and its use in urlwatch stopped working when using the 'html2text' method.

Unrecognized command line option "-utf8", try "-help".

Previous options:

This is html2text, version 1.3.2a

Usage:
  html2text -help
  html2text -version
  html2text [ -unparse | -check ] [ -debug-scanner ] [ -debug-parser ] \
     [ -rcfile <file> ] [ -style ( compact | pretty ) ] [ -width <w> ] \
     [ -o <file> ] [ -nobs ] [ -ascii ] [ <input-url> ] ...
Formats HTML document(s) read from <input-url> or STDIN and generates ASCII
text.
  -help          Print this text and exit
  -version       Print program version and copyright notice
  -unparse       Generate HTML instead of ASCII output
  -check         Do syntax checking only
  -debug-scanner Report parsed tokens on STDERR (debugging)
  -debug-parser  Report parser activity on STDERR (debugging)
  -rcfile <file> Read <file> instead of "$HOME/.html2textrc"
  -style compact Create a "compact" output format (default)
  -style pretty  Insert some vertical space for nicer output
  -width <w>     Optimize for screen widths other than 79
  -o <file>      Redirect output into <file>
  -nobs          Do not use backspaces for boldface and underlining
  -ascii         Use plain ASCII for output instead of ISO-8859-1
  -utf8          Assume both terminal and input stream are in UTF-8 mode

Newer options:

This is html2text, version 2.1.1

Usage:
  html2text -help
  html2text -version
  html2text [ -check ] [ -debug-scanner ] [ -debug-parser ] \
     [ -rcfile <file> ] [ -width <w> ] [ -nobs ] [ -links ]\
     [ -from_encoding ] [ -to_encoding ] [ -ascii ]\
     [ -o <file> ] [ <input-file> ] ...
Formats HTML document(s) read from <input-file> or STDIN and generates ASCII
text.
  -help          Print this text and exit
  -version       Print program version and copyright notice
  -check         Do syntax checking only
  -debug-scanner Report parsed tokens on STDERR (debugging)
  -debug-parser  Report parser activity on STDERR (debugging)
  -rcfile <file> Read <file> instead of "$HOME/.html2textrc"
  -width <w>     Optimize for screen widths other than 79
  -nobs          Do not render boldface and underlining (using backspaces)
  -links         Generate reference list with link targets
  -from_encoding Treat input encoded as given encoding
  -to_encoding   Output using given encoding
  -ascii         Use plain ASCII for output instead of UTF-8
                 alias for: -to_encoding ASCII//TRANSLIT 
  -o <file>      Redirect output into <file>

It seems it now defaults to UTF-8, while previous versions assumed ISO-8859-1 without the -utf8.

Beyond the version difference, it seems the development of html2text has switched hands in which the 2.0.0+ versions no longer have the -utf8 switch.

Currently this is relevant here:

https://github.com/thp/urlwatch/blob/1836a41c7d93dcc9129826a8f88f321194c4fa67/lib/urlwatch/html2txt.py#L93

I'm not seeing a clean solution. Just wanted to report this for now as it will become a bigger issue as more people update html2text. I guess current alternatives (besides not upgrading html2text) is to use one of the other html2text methods (eg: pyhtml2text).

thp commented 2 years ago

I filed an upstream bug report so that they may consider re-adding -utf8 as no-op just for backwards compatibility.

As a workaround for now, urlwatch will now check the html2text -help output for -utf8 and if found, will add it to the command-line arguments, and if not, will leave it out (this avoids having to parse version numbers and stuff). This should make it work with both old and new versions, albeit with the downside that html2text is executed twice (once for -help and then for the real conversion).