uutils / coreutils

Cross-platform Rust rewrite of the GNU coreutils
https://uutils.github.io/
MIT License
17.65k stars 1.27k forks source link

`expr` is failing with multibyte chars #3132

Open sylvestre opened 2 years ago

sylvestre commented 2 years ago

It causes https://github.com/coreutils/coreutils/blob/master/tests/misc/expr-multibyte.pl to fail

$ ./target/debug/coreutils expr length αbcdef
7

GNU:

$ expr length αbcdef
6

needs to have a different locale compiled like

sudo locale-gen fr_FR.UTF-8
sylvestre commented 2 years ago

Of course, it is about rust. See https://doc.rust-lang.org/book/ch08-02-strings.html#internal-representation

Simple testcase:

fn main() {
    let s = String::from("αbcdef");
    assert_eq!(s.len(), 6);
}

=>

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `7`,
 right: `6`', src/main.rs:3:5
tertsdiepraam commented 2 years ago

I did some extra testing to check whether we need unicode segmentation here and we don't. GNU expr outputs a length of 2 for this emoji:

[src/main.rs:4] "🇳🇱".len() = 8
[src/main.rs:5] "🇳🇱".chars().count() = 2
[src/main.rs:6] UnicodeSegmentation::graphemes("🇳🇱", true).count() = 1

Playground link

sylvestre commented 2 years ago

Yeah, I am working on a fix :)

sylvestre commented 2 years ago

To reproduce: bash util/run-gnu-test.sh tests/misc/expr-multibyte

sylvestre commented 2 years ago

Actually, my patch was wrong, it should take in account the locale

$ LANG=C expr length αbcdef
7
$ LANG=fr_FR.UTF-8 expr length αbcdef
6

seems that we should use MB_CUR_MAX to see the number of bytes