html2text lynx: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 154: invalid continuation byte

thp / urlwatch

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.

Other

2.84k stars 349 forks source link

Error: ERROR: Docs (https://istina.msu.ru/dissertation_councils/by_organization/214524/documents/)

Traceback (most recent call last): File "/usr/lib/python3.7/site-packages/urlwatch/handler.py", line 91, in process data = FilterBase.process(filter_kind, subfilter, self, data) File "/usr/lib/python3.7/site-packages/urlwatch/filters.py", line 89, in process return filtercls(state.job, state).filter(data, subfilter) File "/usr/lib/python3.7/site-packages/urlwatch/filters.py", line 174, in filter return html2text(data, method=method, options=options) File "/usr/lib/python3.7/site-packages/urlwatch/html2txt.py", line 97, in html2text stdout = stdout.decode(stdout_encoding) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 154: invalid continuation byte

Try a different html2text method, it seems like lynx might have problem displaying some characters (the website encoding seems to be UTF-8, so that seems fine):

quoted from lib/urlwatch/html2txt.py:

    Method may be one of:
     'lynx'           - Use "lynx -dump" for conversion
                        options: see "lynx -help" output for options that work with "-dump"
     'html2text'      - Use "html2text -nobs" for conversion
                        options: https://linux.die.net/man/1/html2text
     'bs4'            - Use Beautiful Soup library to prettify the HTML
                        options: "parser" only, bs4 supports "lxml", "html5lib", and "html.parser"
                        https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use
     're'             - A simple regex-based HTML tag stripper
     'pyhtml2text'    - Use Python module "html2text"
                        options: https://github.com/Alir3z4/html2text/blob/master/docs/usage.md#available-options

You can use urlwatch --test-filter <NUMBER> (where <NUMBER> is the index from urlwatch --list) to check the output.

It worked for me with at least the re, bs4 and pyhtml2text filter (you might need to install some dependencies).

thp / urlwatch

html2text lynx: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 154: invalid continuation byte #426

Error: `ERROR: Docs (https://istina.msu.ru/dissertation_councils/by_organization/214524/documents/)`

thp / urlwatch

html2text lynx: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 154: invalid continuation byte #426

Error: ERROR: Docs (https://istina.msu.ru/dissertation_councils/by_organization/214524/documents/)

Error: `ERROR: Docs (https://istina.msu.ru/dissertation_councils/by_organization/214524/documents/)`