rustcoreutils / posixutils-rs

Core POSIX command line utilities in safe Rust
MIT License
252 stars 21 forks source link

Localization (i18n): notes and planning #72

Open jgarzik opened 5 months ago

jgarzik commented 5 months ago

Introduction

Soliciting discussion over the localization (i18n) strategy for this project.

Goals

Goal 1: Localize everything

The goal is complete localization of all messages visible to the user, within the bounds of POSIX compliance:

Goal 2: Encourage UTF-8

To be forward-looking, this project looks for opportunities to

This project should aggressively interpret the POSIX standards in terms of UTF-8 support, and look for opportunities to create default-UTF-8 operating modes, with a fallback mode that is "POSIX-ly correct."

Implementation strategies

Current strategy

The current strategies are,

Improvements to our i18n

At present, OS error messages and --help are not translated at all, and need a project-wide strategy.

Also, one idea that is aligned with the gencat util is to generate catgets message catalogs and abandon gettext. This works because catgets exists on all modern platforms.

See issue #65 for util-related tasks.

Feedback and thoughts are requested. We want to give users the best i18n support possible.

kellpossible commented 1 week ago

posixutils-rs Localization Proposal

Requirements after offline discussion with @jgarzik :

  1. Prefer a single strings output file, or, a single output file per supported language
  2. Need to have a strategy for
    1. app strings
    2. OS error strings: all the errno error codes. e.g. "no such file or directory" upon open.
    3. --help strings.
  3. It would be nice if app and help strings are embedded within each util's source code, and extracted with cargo xtr a la gettext crate, but not required.
  4. Must use POSIX-standard environment variables such as NLSPATH (described in each man page) or LC_COLLATE (sort order)
  5. a single message file for posixutils as a whole. It creates a distribution nightmare to have one-file-per-util, and that also eliminates the strong possibility of sharing strings and sharing translations across utils.
  6. Does not impact the normal build + test process in a large & negative way (e.g. slows down each build by 10 minutes would be negative). Developer productivity and developer throughput is a goal.

BOTH of the following implementation strategies are valid:
(1) extract strings from each .rs source file
(2) the "IBM approach": assign unique numbers to each and every error message, whether app-specific or generic, and maintain a posixutils global set of strings (and their translations)

Considerations/Research

2b) OS Error Strings

According to https://stackoverflow.com/questions/43019882/does-libc-show-international-error-messages it seems like we should be able to use https://docs.rs/libc/latest/libc/fn.strerror.html to gain access to system provided localized messages for libc errno codes, but it sees like there might be some safety concerns related to using this function that will warrant more investigation if we use it https://users.rust-lang.org/t/unsafe-and-strerror-impossible-to-fix/90804
Most likely posixutils-rs will preference using Rust standard library functions over libc. Perhaps we can use https://doc.rust-lang.org/stable/std/io/struct.Error.html#method.raw_os_error

It seems like it should be possible https://play.rust-lang.org/?version=stable\&mode=debug\&edition=2021\&gist=52ae229dd19b14298c25d488516d3750 . Here is the output using the french locale installed on my system:

image

It seems like libc::setlocale() needs to be called manually using the contents of the relevant environment variables, the locale for libc is not automatically detected from the available environment variables.

Interestingly the std::io::Error implementation appears to defer to libc::strerror() for its output, so there will be no need for us to call libc::strerror() manually, we can simply use the std::fmt::Display output provided we configure libc using libc::setlocale() This almost feels like an oversight on the part of Rust’s standard library not to have this functionality enabled by default on platforms that support it.

2c) –help messages

With the exception of m4 all the binaries in posixutils-rs currently use clap’s derive macro to generate help messages. In https://docs.rs/clap/latest/clap/_derive/index.html#command-attributes the about attribute accepts an expression about [= <expr>] which presumably gets put into https://docs.rs/clap/latest/clap/struct.Command.html#method.about. To use this with localized messages, the messages would need to be available in a ’static lifetime. To get around this we could use some kind of static thread-local or global cell containing a mutex that can be used to load the appropriate locale at the start of main() based on the current system settings, before executing clap. Another issue is that using the about attribute disables the parsing of help messages from the doc attribute provided by Rust’s documentation comments, which means that if we want to have these struct fields documented in the standard way for Rust, we will end up with duplicated text. We could probably get around this by creating a custom derive macro which wraps clap’s one to use the documentation comments for these fields in the localization system and provide the necessary about too to satisfy the optional requirement 3. There does come a question about the source of truth: if each application is sharing messages in a single registry as per requirement 5., then we may end up with duplicates that need to be detected. Perhaps it is better just to refer to localizations only by a message id, and keep it separated from code documentation comments, this does make the implementation a lot simpler too. Whatever we decide to do here should follow on from the more general decision about how to localize strings in the application for requirement 2a.

Message Format

While an obvious choice for message format would be to use GNU gettext, some arguably better and more modern alternatives do exist. fluent puts forward some good arguments for the choices it makes that are different https://github.com/projectfluent/fluent/wiki/Fluent-vs-gettext In summary of this article:

Further comparisons between systems and crate implementations:

Proposal

This proposal is that we definitely use the fluent localization system instead of gettext, for a minimal setup it could potentially even have a lower overhead, and has none of the licensing concerns with LGPL gettext for systems that must build it statically, it seems like an obvious choice after considering the tradeoffs. If localization is to be taken seriously the features the fluent provides over other simpler ad-hoc systems with simple message formatting are very important.

The next decision is what to use for the scaffolding around fluent. Messages must be loaded from disk, bundles must be configured according to the user’s requested locale, ideally some form of static checking should be employed in order to help prevent mundane runtime errors. If keeping dependencies to a bare minimum is a high priority then we could gradually implement this ourselves from scratch. If however there is a desire to share this functionality with the rust community at large, then I’d propose to use i18n-embed-fl and cargo-i18n and upstream any changes which may be required in order to make it fit the requirements of this project. I’m the maintainer for those projects so I’d be very happy to take on this responsibility if that’s the direction we decide to go with.