thp / urlwatch

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
https://thp.io/2008/urlwatch/
Other
2.81k stars 352 forks source link

fixes UnicodeDecodeError for non English urls.yaml #738

Closed yuis-ice closed 1 year ago

yuis-ice commented 1 year ago

fixes #737

thp commented 1 year ago

While I get the intention, just catching any UnicodeDecodeError is wrong, as it will also fail if the system tries to decode as UTF-8 and then decode twice.

The problem is that Python 3 on Windows 10 doesn't use UTF-8 by default. If you set the environment variable PYTHONUTF8=1 (documentation, PEP-540), then it should work properly.

..and as documented in PEP 540, this requires Python 3.7 and newer.

thp commented 1 year ago

However, it might make sense to add this somewhere to the documentation so that it's easier to find for users.

yuis-ice commented 1 year ago

I understand that. I don't have a good idea for a better solution instead of catching UnicodeDecodeError for now.

thp commented 1 year ago

@yuis-ice Did you test if setting PYTHONUTF8=1 locally in your environment will fix the issue even without this code change?

yuis-ice commented 1 year ago

Yes it works.

> $env:PYTHONUTF8 = '1'

> urlwatch --urls urls.yaml --test-filter 1
                2枚目の楽天カードを作成&利用特典…2,000ポイント(期間限定ポイント)
thp commented 1 year ago

UTF-8 Mode for Windows is now documented here: 3341d880ee7fc7b899abc788570fd0567473d365