Open petRUShka opened 5 years ago
Try a different html2text
method, it seems like lynx might have problem displaying some characters (the website encoding seems to be UTF-8, so that seems fine):
quoted from lib/urlwatch/html2txt.py
:
Method may be one of:
'lynx' - Use "lynx -dump" for conversion
options: see "lynx -help" output for options that work with "-dump"
'html2text' - Use "html2text -nobs" for conversion
options: https://linux.die.net/man/1/html2text
'bs4' - Use Beautiful Soup library to prettify the HTML
options: "parser" only, bs4 supports "lxml", "html5lib", and "html.parser"
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use
're' - A simple regex-based HTML tag stripper
'pyhtml2text' - Use Python module "html2text"
options: https://github.com/Alir3z4/html2text/blob/master/docs/usage.md#available-options
You can use urlwatch --test-filter <NUMBER>
(where <NUMBER>
is the index from urlwatch --list
) to check the output.
It worked for me with at least the re
, bs4
and pyhtml2text
filter (you might need to install some dependencies).
urlwatch 2.17 fails on specific page with error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 154: invalid continuation byte
.Config:
Error:
ERROR: Docs (https://istina.msu.ru/dissertation_councils/by_organization/214524/documents/)