unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.38k stars 176 forks source link

Implement Greek uppercasing behavior of ignoring accents #3552

Closed Manishearth closed 1 year ago

Manishearth commented 1 year ago

Part of https://github.com/unicode-org/icu4x/issues/3234

See https://unicode-org.atlassian.net/browse/ICU-5456, ICU4X does not implement this.

The special case is implemented in GreekUpper in ICU4C and ICU4J. Seems somewhat involved.

Manishearth commented 1 year ago

Currently ICU4C's TestGreekUpper() test doesn't work due to this. Here's the converted Rust version if anyone needs later.

/// ICU4C's TestGreekUpper
#[test]
fn test_greek_upper() {
    let cm = CaseMapping::new_with_locale(&locale!("el"));

    // https://unicode-org.atlassian.net/browse/ICU-5456
    assert_eq!(cm.to_full_uppercase_string("άδικος, κείμενο, ίριδα"), "ΑΔΙΚΟΣ, ΚΕΙΜΕΝΟ, ΙΡΙΔΑ");
    // https://bugzilla.mozilla.org/show_bug.cgi?id=307039
    // https://bug307039.bmoattachments.org/attachment.cgi?id=194893
    assert_eq!(cm.to_full_uppercase_string("Πατάτα"), "ΠΑΤΑΤΑ");
    assert_eq!(cm.to_full_uppercase_string("Αέρας, Μυστήριο, Ωραίο"), "ΑΕΡΑΣ, ΜΥΣΤΗΡΙΟ, ΩΡΑΙΟ");
    assert_eq!(cm.to_full_uppercase_string("Μαΐου, Πόρος, Ρύθμιση"), "ΜΑΪΟΥ, ΠΟΡΟΣ, ΡΥΘΜΙΣΗ");
    assert_eq!(cm.to_full_uppercase_string("ΰ, Τηρώ, Μάιος"), "Ϋ, ΤΗΡΩ, ΜΑΪΟΣ");
    assert_eq!(cm.to_full_uppercase_string("άυλος"), "ΑΫΛΟΣ");
    assert_eq!(cm.to_full_uppercase_string("ΑΫΛΟΣ"), "ΑΫΛΟΣ");
    assert_eq!(cm.to_full_uppercase_string("Άκλιτα ρήματα ή άκλιτες μετοχές"), "ΑΚΛΙΤΑ ΡΗΜΑΤΑ Ή ΑΚΛΙΤΕΣ ΜΕΤΟΧΕΣ");
    // http://www.unicode.org/udhr/d/udhr_ell_monotonic.html
    assert_eq!(cm.to_full_uppercase_string("Επειδή η αναγνώριση της αξιοπρέπειας"), "ΕΠΕΙΔΗ Η ΑΝΑΓΝΩΡΙΣΗ ΤΗΣ ΑΞΙΟΠΡΕΠΕΙΑΣ");
    assert_eq!(cm.to_full_uppercase_string("νομικού ή διεθνούς"), "ΝΟΜΙΚΟΥ Ή ΔΙΕΘΝΟΥΣ");
    // http://unicode.org/udhr/d/udhr_ell_polytonic.html
    assert_eq!(cm.to_full_uppercase_string("Ἐπειδὴ ἡ ἀναγνώριση"), "ΕΠΕΙΔΗ Η ΑΝΑΓΝΩΡΙΣΗ");
    assert_eq!(cm.to_full_uppercase_string("νομικοῦ ἢ διεθνοῦς"), "ΝΟΜΙΚΟΥ Ή ΔΙΕΘΝΟΥΣ");
    // From Google bug report
    assert_eq!(cm.to_full_uppercase_string("Νέο, Δημιουργία"), "ΝΕΟ, ΔΗΜΙΟΥΡΓΙΑ");
    // http://crbug.com/234797
    assert_eq!(cm.to_full_uppercase_string("Ελάτε να φάτε τα καλύτερα παϊδάκια!"), "ΕΛΑΤΕ ΝΑ ΦΑΤΕ ΤΑ ΚΑΛΥΤΕΡΑ ΠΑΪΔΑΚΙΑ!");
    assert_eq!(cm.to_full_uppercase_string("Μαΐου, τρόλεϊ"), "ΜΑΪΟΥ, ΤΡΟΛΕΪ");
    assert_eq!(cm.to_full_uppercase_string("Το ένα ή το άλλο."), "ΤΟ ΕΝΑ Ή ΤΟ ΑΛΛΟ.");
    // http://multilingualtypesetting.co.uk/blog/greek-typesetting-tips/
    assert_eq!(cm.to_full_uppercase_string("ρωμέικα"), "ΡΩΜΕΪΚΑ");
    assert_eq!(cm.to_full_uppercase_string("ή."), "Ή.");
}
sffc commented 1 year ago

This seems like an i18n quality bug, and we don't want to advertise a component as stabilized with known i18n quality bugs. That's one of the checkboxes for stabilizing any component (along with FFI and docs). We can slip on feature coverage but not correctness

Manishearth commented 1 year ago

Would someone be interested in trying to implement this? There's prototype code in https://icu.unicode.org/design/case/greek-upper, and ICU4C/ICU4J both have working implementations.

I am unlikely to have time to get to this this week.