thp / urlwatch

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
https://thp.io/2008/urlwatch/
Other
2.81k stars 352 forks source link

UnicodeDecodeError: 'charmap' codec can't decode byte #737

Closed yuis-ice closed 1 year ago

yuis-ice commented 1 year ago

When I have the following urls.yaml, it succeeds,

kind: url
name: rakuten card promo
url: https://www.rakuten-card.co.jp/campaign/add-card/
filter: 
  - css: "section[id='rule_detail']"
  - html2text:
    method: re
  - grep: "2,000"
> urlwatch --urls urls.yaml --test-filter 1
                2枚目の楽天カードを作成&利用特典…2,000ポイント(期間限定ポイント)

but with the following urls.yaml, where I have a utf-8 text content on it, it gets an error.

kind: url
name: rakuten card promo
url: https://www.rakuten-card.co.jp/campaign/add-card/
filter: 
  - css: "section[id='rule_detail']"
  - html2text:
    method: re
  - grep: "2,000ポイント"
> urlwatch --urls urls.yaml --test-filter 1
Traceback (most recent call last):
  File "C:\Python37\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\pg\urlwatch_dev\venv\Scripts\urlwatch.exe\__main__.py", line 7, in <module>
  File "c:\pg\urlwatch_dev\venv\lib\site-packages\urlwatch\cli.py", line 108, in main
    urlwatch = Urlwatch(command_config, config_storage, cache_storage, urls_storage)
  File "c:\pg\urlwatch_dev\venv\lib\site-packages\urlwatch\main.py", line 66, in __init__
    self.load_jobs()
  File "c:\pg\urlwatch_dev\venv\lib\site-packages\urlwatch\main.py", line 85, in load_jobs
    jobs = self.urls_storage.load_secure()
  File "c:\pg\urlwatch_dev\venv\lib\site-packages\urlwatch\storage.py", line 316, in load_secure
    jobs = self.load()
  File "c:\pg\urlwatch_dev\venv\lib\site-packages\urlwatch\storage.py", line 419, in load
    return self._parse(fp)
  File "c:\pg\urlwatch_dev\venv\lib\site-packages\urlwatch\storage.py", line 385, in _parse
    jobs = [JobBase.unserialize(job) for job in yaml.load_all(fp, Loader=yaml.SafeLoader)
  File "c:\pg\urlwatch_dev\venv\lib\site-packages\urlwatch\storage.py", line 385, in <listcomp>
    jobs = [JobBase.unserialize(job) for job in yaml.load_all(fp, Loader=yaml.SafeLoader)
  File "c:\pg\urlwatch_dev\venv\lib\site-packages\yaml\__init__.py", line 90, in load_all
    loader = Loader(stream)
  File "c:\pg\urlwatch_dev\venv\lib\site-packages\yaml\loader.py", line 34, in __init__
    Reader.__init__(self, stream)
  File "c:\pg\urlwatch_dev\venv\lib\site-packages\yaml\reader.py", line 85, in __init__
    self.determine_encoding()
  File "c:\pg\urlwatch_dev\venv\lib\site-packages\yaml\reader.py", line 124, in determine_encoding
    self.update_raw()
  File "c:\pg\urlwatch_dev\venv\lib\site-packages\yaml\reader.py", line 178, in update_raw
    data = self.stream.read(size)
  File "C:\Python37\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 214: character maps to <undefined>

Versions:

> python --version
Python 3.7.6

> urlwatch --version
urlwatch 2.25

OS: Windows 10, Powershell