posixutils-rs Localization Proposal

Requirements after offline discussion with @jgarzik :

Prefer a single strings output file, or, a single output file per supported language
Need to have a strategy for
1. app strings
2. OS error strings: all the errno error codes. e.g. "no such file or directory" upon open.
3. --help strings.
It would be nice if app and help strings are embedded within each util's source code, and extracted with cargo xtr a la gettext crate, but not required.
Must use POSIX-standard environment variables such as NLSPATH (described in each man page) or LC_COLLATE (sort order)
a single message file for posixutils as a whole. It creates a distribution nightmare to have one-file-per-util, and that also eliminates the strong possibility of sharing strings and sharing translations across utils.
Does not impact the normal build + test process in a large & negative way (e.g. slows down each build by 10 minutes would be negative). Developer productivity and developer throughput is a goal.

BOTH of the following implementation strategies are valid:
(1) extract strings from each .rs source file
(2) the "IBM approach": assign unique numbers to each and every error message, whether app-specific or generic, and maintain a posixutils global set of strings (and their translations)

Considerations/Research

2b) OS Error Strings

According to https://stackoverflow.com/questions/43019882/does-libc-show-international-error-messages it seems like we should be able to use https://docs.rs/libc/latest/libc/fn.strerror.html to gain access to system provided localized messages for libc errno codes, but it sees like there might be some safety concerns related to using this function that will warrant more investigation if we use it https://users.rust-lang.org/t/unsafe-and-strerror-impossible-to-fix/90804
Most likely posixutils-rs will preference using Rust standard library functions over libc. Perhaps we can use https://doc.rust-lang.org/stable/std/io/struct.Error.html#method.raw_os_error

It seems like it should be possible https://play.rust-lang.org/?version=stable\&mode=debug\&edition=2021\&gist=52ae229dd19b14298c25d488516d3750 . Here is the output using the french locale installed on my system:

It seems like libc::setlocale() needs to be called manually using the contents of the relevant environment variables, the locale for libc is not automatically detected from the available environment variables.

Interestingly the std::io::Error implementation appears to defer to libc::strerror() for its output, so there will be no need for us to call libc::strerror() manually, we can simply use the std::fmt::Display output provided we configure libc using libc::setlocale() This almost feels like an oversight on the part of Rust’s standard library not to have this functionality enabled by default on platforms that support it.

2c) –help messages

With the exception of m4 all the binaries in posixutils-rs currently use clap’s derive macro to generate help messages. In https://docs.rs/clap/latest/clap/_derive/index.html#command-attributes the about attribute accepts an expression about [= <expr>] which presumably gets put into https://docs.rs/clap/latest/clap/struct.Command.html#method.about. To use this with localized messages, the messages would need to be available in a ’static lifetime. To get around this we could use some kind of static thread-local or global cell containing a mutex that can be used to load the appropriate locale at the start of main() based on the current system settings, before executing clap. Another issue is that using the about attribute disables the parsing of help messages from the doc attribute provided by Rust’s documentation comments, which means that if we want to have these struct fields documented in the standard way for Rust, we will end up with duplicated text. We could probably get around this by creating a custom derive macro which wraps clap’s one to use the documentation comments for these fields in the localization system and provide the necessary about too to satisfy the optional requirement 3. There does come a question about the source of truth: if each application is sharing messages in a single registry as per requirement 5., then we may end up with duplicates that need to be detected. Perhaps it is better just to refer to localizations only by a message id, and keep it separated from code documentation comments, this does make the implementation a lot simpler too. Whatever we decide to do here should follow on from the more general decision about how to localize strings in the application for requirement 2a.

Message Format

While an obvious choice for message format would be to use GNU gettext, some arguably better and more modern alternatives do exist. fluent puts forward some good arguments for the choices it makes that are different https://github.com/projectfluent/fluent/wiki/Fluent-vs-gettext In summary of this article:

Using a message identifier unique from the source string makes the process of updating source language simpler (without invalidating translations and relying on fuzzy matching), and enables different translations for messages which may have identical English words but which in some translations may result in different words based on the context in which they are used, without the burden being on the developer to recognise these situations. It also enables message re-use/composition via references.
Gettext’s support for grammatical rules is very limited.
String formatting and message arguments are an afterthought for Gettext.
Fluent uses a single data format, Gettext uses 3.
Rust’s compiler messages are translated using fluent https://rustc-dev-guide.rust-lang.org/diagnostics/translation.html, a significant endorsement.

Further comparisons between systems and crate implementations:

There are two Rust crates for gettext,https://docs.rs/gettext/latest/gettext/ (which claims to be a work in progress, pure rust implementation, hasn’t been updated in 5 years), and https://docs.rs/gettext-rs/latest/gettextrs/ bindings to GNU gettext (bringing the associated downsides of using a C dependency in a Rust project).
i18n-embed + i18n-embed-fl provides some additional functionality on top of a basic fluent setup:
- Static checking of message keys and format arguments using procedural macro, this is a big one, avoiding common runtime error mistakes. There is another library which does this for fluent using codegen
- A standardized layout for localization resources that enables building more static analysis tooling for cargo-i18n https://github.com/kellpossible/cargo-i18n/issues/31 Actually it’s possible to benefit from these without actually using i18n-embed but simply using the i18n.toml config file to use with cargo-i18n.
- Some alternatives that provide similar functionality
  https://github.com/zaytsev/fluent-static ttakes an alternative approach to i18n-embed-fl and uses codegen instead of a proc macro, this provides code completion and type signatures for messages as functions (there is an open issue to implement this for i18n-embed-fl https://github.com/kellpossible/cargo-i18n/issues/73 ).
  https://github.com/MathieuTricoire/l10n
Burden on compilation:
- gettext adds 9 additional crate dependencies, an additional 0.05s to build time.
- gettext-rs adds 6 crate dependencies, additional 144s to build time (if static build), or 0.8s if using gettext-system feature for dynamic linking.
- fluent brings in an additional 15 crate dependencies, additional 0.7s to build time.
- i18n-embed + i18n-embed-fl brings an additional 55 dependencies, additional 4.5s to build, I have some ideas for how to bring this down considerably (https://github.com/kellpossible/cargo-i18n/issues/131).
An important consideration for which message format to use is its support in localization tooling. Because posixutils-rs is by its nature a technical tool it’s not unreasonable to expect translators working on the project would be familiar and fine working with plain text files and a git repository in order to make contributions to the project.
https://github.com/baptiste0928/rosetta Another alternative based on a json message serialization format and custom string formatting, with code generation for static type checking.
A custom message format could be constructed using a serialization format like TOML and message formatting/arguments using something like https://github.com/dtolnay/basic-toml and https://lib.rs/crates/minijinja if serde were an acceptable dependency, however currently none of our tools depend on serde, and considering strings are sprinkled throughout the code and serde derive macro has a reputation for increasing compile times if used extensively, probably we want a solution that doesn’t rely on it.

Proposal

This proposal is that we definitely use the fluent localization system instead of gettext, for a minimal setup it could potentially even have a lower overhead, and has none of the licensing concerns with LGPL gettext for systems that must build it statically, it seems like an obvious choice after considering the tradeoffs. If localization is to be taken seriously the features the fluent provides over other simpler ad-hoc systems with simple message formatting are very important.

The next decision is what to use for the scaffolding around fluent. Messages must be loaded from disk, bundles must be configured according to the user’s requested locale, ideally some form of static checking should be employed in order to help prevent mundane runtime errors. If keeping dependencies to a bare minimum is a high priority then we could gradually implement this ourselves from scratch. If however there is a desire to share this functionality with the rust community at large, then I’d propose to use i18n-embed-fl and cargo-i18n and upstream any changes which may be required in order to make it fit the requirements of this project. I’m the maintainer for those projects so I’d be very happy to take on this responsibility if that’s the direction we decide to go with.

rustcoreutils / posixutils-rs

Localization (i18n): notes and planning #72

Introduction

Goals

Goal 1: Localize everything

Goal 2: Encourage UTF-8

Implementation strategies

Current strategy

Improvements to our i18n