ruyisdk / ruyi

RuyiSDK Package Manager
Apache License 2.0
15 stars 8 forks source link

fix: exception when news contains UTF-8 character #185

Closed RekiDunois closed 2 months ago

RekiDunois commented 3 months ago

当运行 ruyi news list 命令时,没有在 open() 命令中指定 encoding,当读出来的内容中包含 Unicode 字符时,会导致 ruyi 抛出 UnicodeDecodeError 的异常:

Traceback (most recent call last):
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/__main__.py", line 53, in <module>
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/ruyi/cli/__init__.py", line 319, in main
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/ruyi/ruyipkg/news_cli.py", line 42, in cli_news_list
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/ruyi/ruyipkg/repo.py", line 462, in news_store
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/ruyi/ruyipkg/repo.py", line 454, in ensure_news_cache
  File "encodings/ascii.py", line 26, in decode
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 25: ordinal not in range(128)

但是此问题不是必现,只有使用编译/发布的二进制才会出现。在开发环境中不会出现。

分析如下:

  1. open函数会使用 locale.getpreferredencoding()的值来作为默认的 encoding,打印之后发现这个值在命令行中运行命令时是 ANSI_X3.4-1968,使用 python 来直接运行 .py文件时则是 utf-8 ,这就是在开发环境中跑的时候不会出错的原因

  2. 发现系统中的全局变量 LC_CTYPELC_ALL都是没有设定的,但是 LANG有设定为 en_US.UTF-8,可能前两个没有的情况下就会导致locale.getpreferredencoding()值为ANSI_X3.4-1968

  3. repo.py ensure_news_cache()open() 时显式指定 encodingutf-8 可避免此问题

xen0n commented 2 months ago

我这边无法复现,我本地在 ruyi --version 里加了 locale.getpreferredencoding() 的调试输出,显示是 UTF-8。我的 LANG 也是 en_US.UTF-8

您可以提供运行环境的更多信息吗?个人有些怀疑是容器之类的 minimal 环境,locale-gen 没有跑。但我没有亲自验证。

RekiDunois commented 2 months ago

是在vmware里安装的arch虚拟机:

⋊> reki@RekiArch ⋊> ~ neofetch                                                                                                 09:37:53
                   -`                    reki@RekiArch
                  .o+`                   -------------
                 `ooo/                   OS: Arch Linux x86_64
                `+oooo:                  Host: VMware20,1 None
               `+oooooo:                 Kernel: 6.10.3-arch1-1
               -+oooooo+:                Uptime: 3 mins
             `/:-:++oooo+:               Packages: 407 (pacman)
            `/++++/+++++++:              Shell: fish 3.7.1
           `/++++++++++++++:             Resolution: 1280x800
          `/+++ooooooooooooo/`           Terminal: /dev/pts/0
         ./ooosssso++osssssso+`          CPU: AMD Ryzen 7 5800X (8) @ 4.200GHz
        .oossssso-````/ossssss+`         GPU: 00:0f.0 VMware SVGA II Adapter
       -osssssso.      :ssssssso.        Memory: 306MiB / 7904MiB
      :osssssss/        osssso+++.
     /ossssssss/        +ssssooo/-
   `/ossssso+/:-        -:/+osssso+-
  `+sso+:-`                 `.-/+oso:
 `++:.                           `-/+/
 .`                                 `/

⋊> reki@RekiArch ⋊> ~ locale -a                                                                                                09:37:55
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
C
C.utf8
POSIX
⋊> reki@RekiArch ⋊> ~ cat /etc/os-release
NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling
ANSI_COLOR="38;2;23;147;209"
HOME_URL="https://archlinux.org/"
DOCUMENTATION_URL="https://wiki.archlinux.org/"
SUPPORT_URL="https://bbs.archlinux.org/"
BUG_REPORT_URL="https://gitlab.archlinux.org/groups/archlinux/-/issues"
PRIVACY_POLICY_URL="https://terms.archlinux.org/docs/privacy-policy/"
LOGO=archlinux-logo

不过 /etc/locale.gen 里面好像没有任何一行是取消注释的:

...
#zh_CN.GB18030 GB18030
#zh_CN.GBK GBK
#zh_CN.UTF-8 UTF-8
#zh_CN GB2312
#zh_HK.UTF-8 UTF-8
#zh_HK BIG5-HKSCS
#zh_SG.UTF-8 UTF-8
#zh_SG.GBK GBK
#zh_SG GB2312
#zh_TW.EUC-TW EUC-TW
#zh_TW.UTF-8 UTF-8
#zh_TW BIG5
#zu_ZA.UTF-8 UTF-8
#zu_ZA ISO-8859-1
...
xen0n commented 2 months ago

我这里用新装的 archlinux:latest 容器,无法复现,尤其在于:

[root@0bdeeafdf888 /]# locale -a
C 
C.utf8
POSIX

注意:没有报 Cannot set LC_* to default locale: No such file or directory 的错误。

目前怀疑你的环境没有进行过 locale-gen,执行一下之后再试试?(当 /etc/locale.gen 没有明确启用任何 locale 的时候,locale-gen 会生成 glibc 所支持的所有 locales。)

如果确认是这个原因导致的非预期行为的话,那么合理的修复应该是:探测这个问题并提醒用户自行解决。

RekiDunois commented 2 months ago

我试了一下跑了locale-gen,还是会有这个问题

⋊> reki@RekiArch ⋊> /u/l/locale sudo locale-gen                                                                  11:01:47
Generating locales...
  zh_CN.UTF-8... done
Generation complete.
⋊> reki@RekiArch ⋊> /u/l/locale ls                                                                               11:01:54
drwxr-xr-x root root 4.0 KB Mon Aug  5 15:21:03 2024  C.utf8
.rw-r--r-- root root 3.1 MB Thu Sep 19 11:01:54 2024  locale-archive
⋊> reki@RekiArch ⋊> /u/l/locale ruyi news list                                                                   11:01:55
Traceback (most recent call last):
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/__main__.py", line 53, in <module>
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/ruyi/cli/__init__.py", line 319, in main
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/ruyi/ruyipkg/news_cli.py", line 42, in cli_news_list
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/ruyi/ruyipkg/repo.py", line 462, in news_store
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/ruyi/ruyipkg/repo.py", line 454, in ensure_news_cache
  File "encodings/ascii.py", line 26, in decode
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 25: ordinal not in range(128)
RekiDunois commented 2 months ago

我试了一下,只要把 LC_CTYPE 这个变量设置为 locale -a 里面支持 utf-8 的项,它就可以正常解析。反之把它设为别的东西或者置为空,它就会抛出异常,所以理论上 catch 到这个错误之后只要检查变量 LC_CTYPE 的值或者返回 locale -a 的输出就可以提醒用户是否遇到相同的问题了。

相关文档:https://docs.python.org/3/library/locale.html#locale.getpreferredencoding

⋊> reki@RekiArch ⋊> ~ set -gx LC_CTYPE                                                                           11:19:40
⋊> reki@RekiArch ⋊> ~ ruyi news list                                                                             11:19:51
Traceback (most recent call last):
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/__main__.py", line 53, in <module>
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/ruyi/cli/__init__.py", line 319, in main
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/ruyi/ruyipkg/news_cli.py", line 42, in cli_news_list
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/ruyi/ruyipkg/repo.py", line 462, in news_store
  File "/home/reki/.cache/ruyi/progcache/0.16.0/x86_64/ruyi/ruyipkg/repo.py", line 454, in ensure_news_cache
  File "encodings/ascii.py", line 26, in decode
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 25: ordinal not in range(128)
⋊> reki@RekiArch ⋊> ~ set -gx LC_CTYPE zh_CN.utf8                                                                11:19:53
⋊> reki@RekiArch ⋊> ~ ruyi news list                                                                             11:19:57
News items:

 No.   ID                                  Title
────────────────────────────────────────────────────────────────────────────────────────────
 1     2024-01-14-ruyi-news                RuyiSDK 支持展示新闻了
 2     2024-01-15-new-board-images         新增板卡支持 (2024-01-15)
 3     2024-01-29-new-board-images         新增板卡支持 (2024-01-29)
 4     2024-01-29-ruyi-0.4                 RuyiSDK 0.4 版本更新说明
 5     2024-02-26-gnu-plct-rv64ilp32-elf   RV64ILP32 裸机工具链与 profile 现已可用
 6     2024-04-23-ruyi-0.9                 RuyiSDK 0.9 版本更新说明
 7     2024-05-14-ruyi-0.10                RuyiSDK 0.10 版本更新说明
 8     2024-05-28-ruyi-0.11                RuyiSDK 0.11 版本更新说明
 9     2024-06-11-ruyi-0.12                RuyiSDK 0.12 版本更新说明
 10    2024-06-24-ruyi-0.13                RuyiSDK 0.13 版本更新说明
 11    2024-07-08-box64-wps-office-poc     尝鲜:使用 Box64 在 RISC-V 系统上运行 WPS Office
 12    2024-07-09-ruyi-0.14                RuyiSDK 0.14 版本更新说明
 13    2024-07-23-ruyi-0.15                RuyiSDK 0.15 版本更新说明
xen0n commented 2 months ago

无法复现

archlinux:latest 容器内:

# pacman -Sy fish python && fish
[...]
root@3d18c58e7ac6 /# set -gx LC_CTYPE
root@3d18c58e7ac6 /# python -c 'import locale; print(locale.getpreferredencoding())'
UTF-8
root@3d18c58e7ac6 /# locale
LANG=C.UTF-8
LC_CTYPE=
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=

您的环境的 locale 输出是怎样的?

RekiDunois commented 2 months ago

又试了一下,在另一个正常的虚拟机环境里去修改环境变量,有这样的结果:

23:19 reki@HyperVArch ~ ./rw
$ set -gx LANG
23:19 reki@HyperVArch ~ ./rw
$ locale
LANG=
LC_CTYPE=
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES=C.UTF-8
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
23:19 reki@HyperVArch ~ ./rw
$ ruyi news list
Traceback (most recent call last):
  File "/home/reki/.cache/ruyi/progcache/0.15.0/x86_64/__main__.py", line 53, in <module>
  File "/home/reki/.cache/ruyi/progcache/0.15.0/x86_64/ruyi/cli/__init__.py", line 319, in main
  File "/home/reki/.cache/ruyi/progcache/0.15.0/x86_64/ruyi/ruyipkg/news_cli.py", line 42, in cli_news_list
  File "/home/reki/.cache/ruyi/progcache/0.15.0/x86_64/ruyi/ruyipkg/repo.py", line 462, in news_store
  File "/home/reki/.cache/ruyi/progcache/0.15.0/x86_64/ruyi/ruyipkg/repo.py", line 454, in ensure_news_cache
  File "encodings/ascii.py", line 26, in decode
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 22: ordinal not in range(128)

然后如果这么改就能恢复正常:

23:19 reki@HyperVArch ~ ./rw
$ set -gx LANG C.UTF-8                                                                                                ↵ 1
23:20 reki@HyperVArch ~ ./rw
$ locale                                                                                                              ↵ 1
LANG=C.UTF-8
LC_CTYPE=
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES=C.UTF-8
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=
23:20 reki@HyperVArch ~ ./rw
$ ruyi news list
News items:

 No.   ID                                  Title
──────────────────────────────────────────────────────────────────────────────────────────────────
 1     2024-01-14-ruyi-news                RuyiSDK now supports displaying news
 2     2024-01-15-new-board-images         New board images available (2024-01-15)
 3     2024-01-29-new-board-images         New board images available (2024-01-29)
 4     2024-01-29-ruyi-0.4                 Release notes for RuyiSDK 0.4
 5     2024-02-26-gnu-plct-rv64ilp32-elf   RV64ILP32 bare-metal toolchain & profile now available
 6     2024-04-23-ruyi-0.9                 Release notes for RuyiSDK 0.9
 7     2024-05-14-ruyi-0.10                Release notes for RuyiSDK 0.10
 8     2024-05-28-ruyi-0.11                Release notes for RuyiSDK 0.11
 9     2024-06-11-ruyi-0.12                Release notes for RuyiSDK 0.12
 10    2024-06-24-ruyi-0.13                Release notes for RuyiSDK 0.13
 11    2024-07-08-box64-wps-office-poc     尝鲜:使用 Box64 在 RISC-V 系统上运行 WPS Office
 12    2024-07-09-ruyi-0.14                Release notes for RuyiSDK 0.14
 13    2024-07-23-ruyi-0.15                Release notes for RuyiSDK 0.15

似乎需要 LANGLC_CTYPE 全都不是 locale -a 中支持的,并且是 UTF-8 的 locale 才会出现

RekiDunois commented 2 months ago

请问这个pr还会合吗,还是说需要继续改成检测这两个变量提醒用户的模式?

xen0n commented 2 months ago

请问这个pr还会合吗,还是说需要继续改成检测这两个变量提醒用户的模式?

我提交了 #185 为所有文本模式的文件在打开时指定了 utf-8 编码。考虑到用户终端等等外部环境确实可能非 utf-8 编码,可能不适合在程序初始化时代替用户设置一个 UTF-8 locale(也不一定能设置成功)。在检测到 locale 设置不合理时,提醒用户修复配置是另一件事情,应该与当前问题分开解决。

由于我们目前仍然在决定以何种方式接受外部贡献(我们使先前的一位外部贡献者在 commit message 增加了 DCO 方式的 Signed-off-by 信息,但未来不一定会继续采用 DCO 方式接受外部贡献),鉴于这个 PR 的提交信息中没有带上 DCO sign-off,我们不会原样合并这个提交。如果你想让这个 PR 合并的话,请至少在提交说明中加入 Signed-off-by 头,具体做法参照 https://developercertificate.org/ 的做法——这样我会把 #202 rebase 到你的提交之上再合并。

RekiDunois commented 2 months ago

如果你想让这个 PR 合并的话,请至少在提交说明中加入 Signed-off-by

已添加