servo / unicode-bidi

Implementation of the Unicode Bidirection Algorithm in Rust
Other
74 stars 34 forks source link

Results don't match the UBA reference implementation #119

Closed jfkthame closed 7 months ago

jfkthame commented 7 months ago

A combination of isolate embedding plus overrides seems to be mishandled.

Consider the following (in BidiCharacterTest.txt format):

05d0 002b 2066 202d 0065 0066 202c 2069 002d 05d1;0;0;1 1 1 x 4 4 x 1 1 1;9 8 7 4 5 2 1 0

05d0 002b 2066 202e 0065 0066 202c 2069 002d 05d1;0;0;1 1 1 x 3 3 x 1 1 1;9 8 7 5 4 2 1 0

These both PASS when run with the current (Unicode 15.1) version of bidiref. However, if I run them with unicode-bidi, both examples FAIL:

2 test cases failed! (0 passed) {

0: Fail { line_num: 4, input_base_level: Some(Level(0)), input_classes: [], input_string: "א+\u{2066}\u{202d}ef\u{202c}\u{2069}-ב", exp_base_level: Some(Level(0)), exp_levels: ["1", "1", "1", "x", "4", "4", "x", "1", "1", "1"], exp_ordering: ["9", "8", "7", "4", "5", "2", "1", "0"], actual_base_level: None, actual_levels: [Level(1), Level(0), Level(0), Level(0), Level(4), Level(4), Level(4), Level(0), Level(0), Level(1)], actual_ordering: ["0", "1", "2", "4", "5", "7", "8", "9"], actual_unfiltered_ordering: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] }

...

1: Fail { line_num: 6, input_base_level: Some(Level(0)), input_classes: [], input_string: "א+\u{2066}\u{202e}ef\u{202c}\u{2069}-ב", exp_base_level: Some(Level(0)), exp_levels: ["1", "1", "1", "x", "3", "3", "x", "1", "1", "1"], exp_ordering: ["9", "8", "7", "5", "4", "2", "1", "0"], actual_base_level: None, actual_levels: [Level(1), Level(0), Level(0), Level(0), Level(3), Level(3), Level(3), Level(1), Level(1), Level(1)], actual_ordering: ["0", "1", "2", "9", "8", "7", "5", "4"], actual_unfiltered_ordering: [0, 1, 2, 3, 9, 8, 7, 6, 5, 4] }

}

I think this is an actual bug, though I admit I'm getting a bit lost among all the rules....

Manishearth commented 7 months ago

https://util.unicode.org/UnicodeJsps/bidic.jsp is useful for running bidiref, fwiw. Though it's still on Unicode 14.

Manishearth commented 7 months ago

https://util.unicode.org/UnicodeJsps/bidic.jsp?s=%D7%90%2B%E2%81%A6%E2%80%ADef%E2%80%AC%E2%81%A9-%D7%91&b=0&u=140&d=2

Manishearth commented 7 months ago

Seems to be a problem in W6, the minus sign should get turned into ON

Manishearth commented 7 months ago

No, that's not it, the isolating run sequences are incorrect. The first and last third of the string should be in a single IRS

Manishearth commented 7 months ago

Traced back through BD13 which is producing incorrect IRSes because the LRO is in the first level run (it should not be)

jfkthame commented 7 months ago

Thanks for the analysis & fix!