rpm-software-management / dnf5

Next-generation RPM package management system
Other
250 stars 86 forks source link

dnf5 CLI does not handle emojis (and potentially other unicode symbols?) in %description #1685

Closed gotmax23 closed 1 month ago

gotmax23 commented 1 month ago

The uv package's description in Fedora has emojis in it, which dnf5 does not handle correctly. See the difference between dnf5 repoquery and dnf5 info's output:

sudo dnf5 repoquery --qf='%{description}' uv properly renders emojis

dnf5 info does not

$ dnf5 --version
dnf5 version 5.1.17
dnf5 plugin API version 1.0
libdnf5 version 5.1.17
libdnf5 plugin API version 1.0

Loaded dnf5 plugins:
  name: builddep
  version: 1.0.0
  API version: 1.0

  name: changelog
  version: 1.0.0
  API version: 1.0

  name: config-manager
  version: 0.1.0
  API version: 1.0

  name: copr
  version: 0.1.0
  API version: 1.0

  name: needs_restarting
  version: 1.0.0
  API version: 1.0

  name: repoclosure
  version: 1.0.0
  API version: 1.0
hroncok commented 1 month ago

@musicinmybrain said in the Fedora Python matrix channel:

I am not deeply committed to the emoji in the description of uv in particular, although all else equal, I default to preserving them since they are a very intentional part of upstream’s writing. music I tend to fall on the side of thinking that emoji are part of Unicode support, and (subject to font availability) failure to display them is a bug. Emoji are not the only things that use codepoints outside the BMP or combine several codepoints into one glyph, but they are the most visible to English speakers. So I tend to assume that if emoji don’t work, something in CJK languages doesn’t work either.

musicinmybrain commented 1 month ago

I just tried this on Fedora 40:

dnf5 info python3-fastapi

and got (excerpted):

               : The key features are:
               : 
               :   \xe2\x80\xa2 Fast: Very high performance, on par with NodeJS and Go (thanks to Starlette
               :     and Pydantic). One of the fastest Python frameworks available.

Now, these are merely Unicode bullets, ! So it looks like dnf5 is treating everything as ASCII and backslash-escaping everything that isn’t ASCII.

The above works fine with dnf 4.21.0 and with rpm -qi.

musicinmybrain commented 1 month ago

There are also differences in localization, which kind of worked in rpm, kind of worked differently in dnf 4, and doesn’t work at all in dnf5. You can compare these in Fedora 40:

(Don’t try other locales for now; I just fixed an issue with the other localized descriptions in the python-fastapi spec file, and the updates for this are not yet in stable repositories.)

I guess this should be filed as a separate issue.

jrohel commented 1 month ago

dnf5 version 5.1.17 does not use locale. Support was added to the upstream some time ago. I assume the problem was solved by this commit https://github.com/rpm-software-management/dnf5/commit/3d302e0b32174eab40176befddb26628f283757a . That is, since dnf5 version 5.2.0.

musicinmybrain commented 1 month ago

Hmm, thanks for that! It does look like this works in Fedora 41. Both the Unicode bullets from python3-fastapi and the emoji from uv are displayed correctly when I try dnf info in a Fedora 41 mock chroot with --enable-network, currently with dnf5 version 5.2.5.0. I had no idea the “preview” version of dnf5 packaged in F39 and F40 was significantly different from the one in F41+.

musicinmybrain commented 1 month ago

It turns out, however, that setting LC_ALL – even to en_US.UTF-8 – seems to break this.

https://github.com/rpm-software-management/dnf5/issues/1687#issuecomment-2360611644

jrohel commented 1 month ago

It turns out, however, that setting LC_ALL – even to en_US.UTF-8 – seems to break this.

Is it a dnf problem? Can you please take a look at https://github.com/rpm-software-management/dnf5/issues/1687#issuecomment-2360779811

musicinmybrain commented 1 month ago

It turns out, however, that setting LC_ALL – even to en_US.UTF-8 – seems to break this.

Is it a dnf problem? Can you please take a look at #1687 (comment)

Indeed, as described in https://github.com/rpm-software-management/dnf5/issues/1687#issuecomment-2361046175, there are two things going on:

So I think that at least with respect to this bug, dnf 5.2.x itself as shipped in Fedora 41 may be behaving as one would expect, as when setlocale fails dnf doesn’t necessarily have access to information about whether the expected output encoding is UTF-8, so the backslash-escaping behavior might be considered reasonable.

musicinmybrain commented 1 month ago

(Deleted a comment that was inaccurate because I did an experiment in a mock chroot that I had forgotten to clean first.)

jrohel commented 1 month ago

We can try to improve it. If the locale setting fails, then instead of the defaulting to "C" message, we can try setting "C.UTF-8". If this succeeds the message defaulting to "C.UTF-8" will be displayed and only if it fails then defaulting to "C". I think dnf4 does it this way.

gotmax23 commented 1 month ago

dnf5 version 5.1.17 does not use locale. Support was added to the upstream some time ago. I assume the problem was solved by this commit 3d302e0 . That is, since dnf5 version 5.2.0.

Can this be backported to the dnf5 version in Fedora 40 as well?

musicinmybrain commented 1 month ago

We can try to improve it. If the locale setting fails, then instead of the defaulting to "C" message, we can try setting "C.UTF-8". If this succeeds the message defaulting to "C.UTF-8" will be displayed and only if it fails then defaulting to "C". I think dnf4 does it this way.

If there aren’t any pitfalls I’m overlooking, this seems like a good idea and a solid improvement.

ppisar commented 1 month ago

I don't think that falling to C.UTF-8 is great idea. If a user has a terminal in non-UTF-8 mode, he asks for e.g. ja_JP.EUCJP end DNF5 starts spitting out UTF-8 characters, it wil bodge his terminal. If glibc reverted a default locale from C.UTF-8 back to C, there was probably a good reason for it.

I would rather use nl_langfinfo() to retrieve current locale code set and print that in the error message. Or rather not mentioning "C" locale at all.

jrohel commented 1 month ago

@ppisar I think C.UTF-8 was created just for cases where the desired locale is not available. And we assume that a modern system uses UTF-8 as the default. https://sourceware.org/glibc/wiki/Proposals/C.UTF-8

dnf4 is also trying a fallback to C.UTF-8 and so far this has proven to be a good solution. Yes there is a risk that the user will have a misconfigured system. He will have a locale available that the terminal will not know. The setlocale() will pass and the terminal will break even without this fallback. Or the user will request an uninstalled/unsupported locale on a terminal without UTF-8 support and then it will break because of this fallback.

Or the user will be a root and type rm -rf / and the whole system will crash. I'm just considering where the line is between user-friendliness and protection from shooting yourself in the foot.

However, you could try tweaking the fallback in your PR to set the C++ locale. Maybe based on the nl_langfinfo() you mention.

jrohel commented 1 month ago

Can this be backported to the dnf5 version in Fedora 40 as well?

I'm not sure. The current plan is to backport anything related to dnf5 to Fedora 40 or lower only if there is a strong reason to do so.