servo / unicode-bidi

Implementation of the Unicode Bidirection Algorithm in Rust
Other
77 stars 33 forks source link

Conformance Rust tests added #20

Closed samruds1 closed 7 years ago

samruds1 commented 8 years ago

Files added: src/BidiCharacterTest.txt src/bracket_pairs.rs src/BidiTest.txt src/UnicodeData.txt

tools/bidiCharacterTestParser.py tools/bidiTest.py tools/bidiCharacterTestParser.pyc
tools/bidiTest.pyc

Preliminary pull request. Need help with integration of new code with existing code. Verification needed for Rust conformance tests added.

Review on Reviewable

samruds1 commented 8 years ago

Here are some of the questions we have so far:

1) Rust code for step N0 has been added to bracket_pairs.rs, need help with integration into lib.rs. 2) Code for W1-W7 has been added to lib.rs. Please verify if correct. 3) Need clarifications for the Rust test cases inserted in src/lib.rs: a) We are not able to execute cargo test possibly due to large number of test cases. b) For test cases added from BidiTest.txt we have removed characters where level equals 'x'. Is this correct. 4) L1 step has not been implemented yet.

samruds1 commented 8 years ago

@eefriedman @mbrubeck - Can you please review.

vicky-katara commented 8 years ago

@jdm, during the execution of the test case automation suite, these strange files get created at some step. We were not able to decipher where or why they get created. However, I have deleted them and have added the entry to the ".gitignore" file to make sure that they never get pushed again. I had added this entry to .gitignore in the commit before the one which had the issue, however I made the commit while inside the /src/ directory, hence ignoring the .gitignore file.

I hope everything looks good now.

vicky-katara commented 8 years ago

@mbrubeck , @eefriedman, @jdm while we have pushed out code in this commit, we are yet to integrate the code for our steps L1 and N0 into lib.rs. Here are some of the points that I need help on, to proceed with this integration:

jdm commented 8 years ago

For the record, @mbrubeck is unfortunately on vacation today and this weekend :/

eefriedman commented 8 years ago

(a) Where will I find the indexes[] array containing character codes for the characters in the input string?

str has a few related methods for extracting characters; see https://doc.rust-lang.org/nightly/std/primitive.str.html#method.chars and https://doc.rust-lang.org/nightly/std/primitive.str.html#method.char_indices . For performance reasons, it's usually best to stick with the &str representation of a string rather than converting it to a &[char].

(a) Should the code for the new steps be part of lib.rs or can they appear in a different file and be imported into lib.rs?

Separate the code into different files however you feel is best. Putting the code for a complicated rule with a single entry point into a separate file seems like a good idea.

(b) Should we keep the rust modules for the two steps L1 and N0 in different rust source code files or the in the same one,(in the case we that we are not integrating them into lib.rs)?

Separating them seems fine.

(c) How do I deliver output codes back for the next step to proceed. For this I need to pass the 'codes' array which holds the output codes for the brackets.

At a high level, step N0 is similar to resolve_weak: you're taking the array of BIDI classes, and resolving some of them, so the entry point probably looks similar.

(b) Where will I find the 'sos' and 'level' variables holding the start of sequence and the embedding direction level for the given run?

See IsolatingRunSequence.

vicky-katara commented 8 years ago

Thanks, @eefriedman. We'll get the changes in as soon as possible!

samruds1 commented 8 years ago

@eefriedman I have some doubts too. Could you please answer them.

1) For the test cases related to BidiTest.txt:

We have added test cases for the method process_text(). The test cases are of the same format as the existing test cases:

assert_eq!(process_text("abc123", Some(0)), BidiInfo { levels: vec![0, 0, 0, 0, 0, 0], classes: vec![L, L, L, EN, EN, EN], paragraphs: vec![ParagraphInfo { range: 0..6, level: 0 }], });

a) Is the following test case correct?

assert_eq!(process_text("\u{FA5E}\u{1F102}\u{FF0B}\u{20B0}\u{FE55}\u{E017E}\u{001C}\u{001F}\u{2000}\u{1015A}\u{2066}\u{2067}\u{2068}\u{2069}", Some(0)), BidiInfo { levels: vec![ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], classes: vec![L, EN, ES, ET, CS, NSM, B, S, WS, ON, LRI, RLI, FSI, PDI], paragraphs: vec![ParagraphInfo { range: 0..14, level: 0 } ], });

b) As per my understanding the value in the Some field in the given test should be the values of the bitset (para levels). Is that correct?

c) While inserting the test cases in lib.rs can we remove Bidi Classes which correspond to levels with the value x.

for example, can we ignore this test case?

@Levels: x @Reorder:
LRE; 7 LRO; 7 RLE; 7

or in the following example, can we remove LRE, LRO, RLE from these test cases?

@Levels: 0 x @Reorder: 0 L LRE; 3 L LRO; 3 L RLE; 3

2) For the implementation of step L1 can a separate function be created (resolve_white_space) instead of inserting the code in the visual_runs method?

vicky-katara commented 8 years ago

@eefriedman, @mbrubeck

I understand that my function which will perform the Step N0 will have to be called in the resolve_neutral function. I want to know how to get the input array. Does an IsolatingRunSequence contain the input string (in a &txt or &[char] format)? If yes, how do I extract it?

eefriedman commented 8 years ago

I understand that my function which will perform the Step N0 will have to be called in the resolve_neutral function. I want to know how to get the input array. Does an IsolatingRunSequence contain the input string (in a &txt or &[char] format)? If yes, how do I extract it?

No, IsolatingRunSequence does not contain a string; you'll have to pass it in separately.

vicky-katara commented 8 years ago

@eefriedman, @mbrubeck

I am getting a failed test case, and I am not able to identify what is wrong. Could you have a look?

Assert Left: "])[(\u{2680}a\u{5d1}(\u{5d0}" Assert Right: "\u{5d1}(\u{5d0}a\u{2680}([)]"

eefriedman commented 8 years ago

It's a little hard to figure out what exactly is going wrong from just the output strings... but are you sure you're implementing BD16 correctly?

vicky-katara commented 8 years ago

Yes, I just debugged BD16. It recognizes only one pair of brackets for the above string.

eefriedman commented 8 years ago

What's the input string for your testcase?

vicky-katara commented 8 years ago

@eefriedman @mbrubeck @jdm

We had some confusion about how to proceed. At this point, we can control what test cases are being inserted from BidiTest.txt and BidiCharacterTest.txt. If we put in all the test cases, the build just does not succeed. And it is getting difficult to identify why.

Should we insert a few passing test cases which test the steps that we have added, so that the pull request can get merged? We could mention the test cases which do not pass.

vicky-katara commented 8 years ago

@eefriedman here is the test case from BidiCharacterTest.txt:

05D0 0028 05D1 0061 2680 0028 005B 0029 005D;0;0;1 1 1 0 0 0 0 0 0;2 1 0 3 4 5 6 7 8

eefriedman commented 8 years ago

Are you sure you're converting the testcase correctly? You can end up with either your left or your right result depending on the default paragraph embedding level (an LTR context versus an RTL context).

vicky-katara commented 8 years ago

I am taking the index values given in the test case to form the result.

Here is the description of a test case: -# Field 0: A sequence of hexadecimal code point values separated by space -# Field 1: A value representing the paragraph direction, as follows: -# 0 represents left-to-right -# 1 represents right-to-left -# 2 represents auto-LTR according to rules P2 and P3 of the algorithm -# Field 2: The resolved paragraph embedding level -# Field 3: A list of resolved levels; characters removed in rule X9 are -# indicated with an 'x' -# Field 4: A list of indices showing the resulting visual ordering from -# left to right; characters with a resolved level of 'x' are skipped

I am using Field 0 as input, and taking indexing into Field 0 using Field 4 as output indices. Do you think there is something wrong with this approach? The LTR/RTL context should be taken care of by the output, right?

eefriedman commented 8 years ago

The "0" in Field 1 represents left-to-right, which means that you should apply HL1 and explicitly set the paragraph embedding level to zero.

vicky-katara commented 8 years ago

Does this mean that I take the output produced by using the indexes of Field 0, and reverse the string?

eefriedman commented 8 years ago

process_text has two inputs; the first is the string, the second is the paragraph embedding level.

vicky-katara commented 8 years ago

@eefriedman, I noticed that as well. Thanks for the help. I'll try with this and get back to you!

vicky-katara commented 8 years ago

I am now getting this error:

thread 'test::test_reorder_line' panicked at 'index 2 and/or 3 in אב(גד[&ef].)gh do not lie on character boundary', ../src/libcore/str/mod.rs:1284

This is the test case: assert_eq!(reorder_with_para_level("\u{05D0}\u{05D1}\u{0028}\u{05D2}\u{05D3}\u{005B}\u{0026}\u{0065}\u{0066}\u{005D}\u{002E}\u{0029}\u{0067}\u{0068}", Some(0)),"\u{05D1}\u{05D0}\u{0028}\u{05D3}\u{05D2}\u{005B}\u{0026}\u{0065}\u{0066}\u{005D}\u{002E}\u{0029}\u{0067}\u{0068}");//BidiCharacterTest.txt Line Number:40

Here is the input:

05D0 05D1 0028 05D2 05D3 005B 0026 0065 0066 005D 002E 0029 0067 0068;0;0;1 1 0 1 1 0 0 0 0 0 0 0 0 0;1 0 2 4 3 5 6 7 8 9 10 11 12 13

Do you know why this fails?

eefriedman commented 8 years ago

A "do not lie on character boundary" panic generally means you're using string indexing incorrectly; e.g. let x = &"⚀"[1..];. (See also https://doc.rust-lang.org/nightly/book/strings.html#slicing .) Try getting a backtrace to figure out where it's happening.

vicky-katara commented 8 years ago

I am not able to make sense of this backtrace. Is there any indicator of the point of failure here?

thread 'test::test_reorder_line' panicked at 'index 2 and/or 3 in אב(גד[&ef].)gh do not lie on character boundary', ../src/libcore/str/mod.rs:1284

stack backtrace: 1: 0x564cc1e817fe - sys::backtrace::write::haf6e4e635ac76143Ivs 2: 0x564cc1e84f25 - panicking::on_panic::ha085a58a08f78856lzx 3: 0x564cc1e7885e - rt::unwind::begin_unwind_inner::hc90ee27246f12475C0w 4: 0x564cc1e791f6 - rt::unwind::begin_unwind_fmt::ha4be06289e0df3dbIZw 5: 0x564cc1e84b86 - rust_begin_unwind 6: 0x564cc1eb3aa4 - panicking::panic_fmt::he7875691f9cbe589SgC 7: 0x564cc1eb624f - str::slice_error_fail::hbe442ed6cb13a72dsqK 8: 0x564cc1e0688a - str::traits::str.ops..Index<ops..Range>::index::h909793bab573faecccK at ../src/libcore/str/mod.rs:1104 9: 0x564cc1e0e9c0 - reorder_line::h180dad6937165a51e6b at /home/servo/unicode-bidi/src/lib.rs:176 10: 0x564cc1e32aae - test::test_reorder_line::reorder_with_para_level::ha84ac2c49163f9500Id at /home/servo/unicode-bidi/src/lib.rs:1237 11: 0x564cc1e31e66 - test::test_reorder_line::h00b48c9c08476d87iId at /home/servo/unicode-bidi/tools/:1262 12: 0x564cc1e5611b - boxed::F.FnBox::call_box::h13096251856404813500 13: 0x564cc1e58de1 - boxed::F.FnBox::call_box::h17352075482666389094 14: 0x564cc1e568c9 - rt::unwind::try::try_fn::h2399095994474881594 15: 0x564cc1e84add - __rust_try 16: 0x564cc1e7fe37 - rt::unwind::try::inner_try::h59523aa853a0e10avWw 17: 0x564cc1e56ae8 - boxed::F.FnBox::call_box::h389096937566750418 18: 0x564cc1e83c81 - sys::thread::Thread::new::thread_start::h890d1188505773c835v 19: 0x7fe317fc7554 - start_thread 20: 0x7fe3178e3b9c - __clone 21: 0x0 -

failures: test::test_reorder_line

test result: FAILED. 11 passed; 1 failed; 0 ignored; 0 measured

thread '

' panicked at 'Some tests failed', ../src/libtest/lib.rs:252 stack backtrace: 1: 0x564cc1e817fe - sys::backtrace::write::haf6e4e635ac76143Ivs 2: 0x564cc1e851b6 - panicking::on_panic::ha085a58a08f78856lzx 3: 0x564cc1e7885e - rt::unwind::begin_unwind_inner::hc90ee27246f12475C0w 4: 0x564cc1e35461 - rt::unwind::begin_unwind::h218749302018745608 5: 0x564cc1e36e1c - test_main::h6a499577710d847aI1a 6: 0x564cc1e3d6d2 - test_main_static::hb27df092379fd7dbd4a 7: 0x564cc1e3404c - test::main::hfd578a788dc6470cDee 8: 0x564cc1e84add - rust_try 9: 0x564cc1e86e7a - rt::lang_start::hefba4015e797c325hux 10: 0x564cc1e341cb - main 11: 0x7fe3178016ff - __libc_start_main 12: 0x564cc1dfb5f8 - _start 13: 0x0 -

vicky-katara commented 8 years ago

@eefriedman, I have commented out the code that we had inserted. It seems that this error arises even with existing code. This tells me that the test case itself is going wrong somewhere. Are you sure this is the correct test case?

BidiCharacterTest line 40

05D0 05D1 0028 05D2 05D3 005B 0026 0065 0066 005D 002E 0029 0067 0068;0;0;1 1 0 1 1 0 0 0 0 0 0 0 0 0;1 0 2 4 3 5 6 7 8 9 10 11 12 13

assert_eq!(reorder_with_para_level("\u{05D0}\u{05D1}\u{0028}\u{05D2}\u{05D3}\u{005B}\u{0026}\u{0065}\u{0066}\u{005D}\u{002E}\u{0029}\u{0067}\u{0068}", Some(0)),"\u{05D1}\u{05D0}\u{0028}\u{05D3}\u{05D2}\u{005B}\u{0026}\u{0065}\u{0066}\u{005D}\u{002E}\u{0029}\u{0067}\u{0068}");//BidiCharacterTest.txt Line Number:40

I have made the following function to test reorder line while specifying the paragraph level input:

fn reorder_with_para_level(s: &str, level: Option) -> Cow { let info = process_text(s, level); let para = &info.paragraphs[0]; reorder_line(s, para.range.clone(), &info.levels) }

eefriedman commented 8 years ago

It looks like it's crashing at https://github.com/servo/unicode-bidi/blob/2a99bcaadd34bda9f8ab035bbdc3d170c5d78d71/src/lib.rs#L169 ... and it looks like that range comes from https://github.com/servo/unicode-bidi/blob/2a99bcaadd34bda9f8ab035bbdc3d170c5d78d71/src/lib.rs#L204 . I'm guessing we're somehow violating the constraint that every byte of a character has the same level?

Anyway, I don't think your test should be triggering this; I'm guessing it's a bug in the code.

vicky-katara commented 8 years ago

Thanks for looking into it.

How should we go about this now?

We've implemented all the steps and added code to pull test cases from BidiTest.txt and BidiCharacterTest.txt

mbrubeck commented 8 years ago

@vicky-katara For now, you can temporarily comment out or delete the failing test(s), or add code to your Python script to do this (maybe using a list of line numbers that are known to fail). File a new issue in GitHub about the failure so that we will come back and fix it later.

vicky-katara commented 8 years ago

@mbrubeck, there are more than a 100,000 test cases in total, from BidiCharacterTest.txt and BidiTest.txt. I don't think finding a list of failing test cases is viable. Will it be fine if we include a list of passing test cases, with comments describing how to add all test cases back to lib.rs (for example by uncommenting code which adds test cases)?

mbrubeck commented 8 years ago

Will it be fine if we include a list of passing test cases, with comments describing how to add all test cases back to lib.rs (for example by uncommenting code which adds test cases)?

Yes, that sounds good.

vicky-katara commented 8 years ago

@eefriedman @mbrubeck

We're getting failed test cases for test cases like this from BidiTest.txt:

@ Levels: x 4 @ Reorder: 1 LRE AN; 7 RLE L; 4 RLE EN; 4 RLE AN; 4

Count: 4

We feel we built the test case correctly. Could you help us convert this test case into a rust assert test case of this form:

This is what we have:

assert_eq!(process_text("\u{0669}\u{11F2}\u{1D7D6}\u{10E74}", Some(1)), BidiInfo { levels: vec![ 7, 4, 4, 4], classes: vec![AN, L, EN, AN], paragraphs: vec![ParagraphInfo { range: 0..4, level: 4 } ], });

vicky-katara commented 8 years ago

@mbrubeck @eefriedman @jdm

Here is what we have accomplished:

Additionally:

*Identified defect with process_text [https://github.com/servo/unicode-bidi/issues/22].

mbrubeck commented 8 years ago

Great work, thanks! I haven't had time to review all of this code yet, but I look forward to getting it merged.

mbrubeck commented 8 years ago

Sorry I haven't reviewed this PR yet; I'm just back from vacation and will be working on it this week. Since there's a lot of code here split across a few dozen commits, I think I'll try to split it into a few separate PRs that can be reviewed landed separately. If I find code that needs to be changed, I'm happy to apply the changes myself (and have another Servo developer review my changes as needed). Or if any of you have the time and inclination to keep working on this code, that would be great too—just let me know.

bors-servo commented 7 years ago

:umbrella: The latest upstream changes (presumably #25) made this pull request unmergeable. Please resolve the merge conflicts.

behnam commented 7 years ago

I think we can close this PR now that we have another approach for having conformance tests (as integration tests) in https://github.com/servo/unicode-bidi/pull/30 . There's still more work needed (including adding the char-dependent tests rules), but they should be done in a separate PR, IMHO.