thp / urlwatch

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
https://thp.io/2008/urlwatch/
Other
2.84k stars 349 forks source link

html2text lynx: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 154: invalid continuation byte #426

Open petRUShka opened 5 years ago

petRUShka commented 5 years ago

urlwatch 2.17 fails on specific page with error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 154: invalid continuation byte.

Config:

filter:
  - html2text:
      method: lynx
kind: url
name: "Docs"
url: https://istina.msu.ru/dissertation_councils/by_organization/214524/documents/

Error: ERROR: Docs (https://istina.msu.ru/dissertation_councils/by_organization/214524/documents/)

Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/urlwatch/handler.py", line 91, in process
    data = FilterBase.process(filter_kind, subfilter, self, data)
  File "/usr/lib/python3.7/site-packages/urlwatch/filters.py", line 89, in process
    return filtercls(state.job, state).filter(data, subfilter)
  File "/usr/lib/python3.7/site-packages/urlwatch/filters.py", line 174, in filter
    return html2text(data, method=method, options=options)
  File "/usr/lib/python3.7/site-packages/urlwatch/html2txt.py", line 97, in html2text
    stdout = stdout.decode(stdout_encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 154: invalid continuation byte
thp commented 4 years ago

Try a different html2text method, it seems like lynx might have problem displaying some characters (the website encoding seems to be UTF-8, so that seems fine):

quoted from lib/urlwatch/html2txt.py:

    Method may be one of:
     'lynx'           - Use "lynx -dump" for conversion
                        options: see "lynx -help" output for options that work with "-dump"
     'html2text'      - Use "html2text -nobs" for conversion
                        options: https://linux.die.net/man/1/html2text
     'bs4'            - Use Beautiful Soup library to prettify the HTML
                        options: "parser" only, bs4 supports "lxml", "html5lib", and "html.parser"
                        https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use
     're'             - A simple regex-based HTML tag stripper
     'pyhtml2text'    - Use Python module "html2text"
                        options: https://github.com/Alir3z4/html2text/blob/master/docs/usage.md#available-options

You can use urlwatch --test-filter <NUMBER> (where <NUMBER> is the index from urlwatch --list) to check the output.

It worked for me with at least the re, bs4 and pyhtml2text filter (you might need to install some dependencies).