json-schema-org / JSON-Schema-Test-Suite

A language agnostic test suite for the JSON Schema specifications
MIT License
625 stars 209 forks source link

Tests for mixed Arabic-Indic digits and Extended Arabic-Indic digits violates the Bidi rule #737

Closed OptimumCode closed 6 months ago

OptimumCode commented 7 months ago

The current test cases (like this one) violate the Bidi rule and does not test what they are supposed to test.

The Bidi rule is violated because:

By adding the first character with type AL the string will pass the Bidi rule validation but still will fail the Arabic-Indic digits rule.

Please, correct me if I missed something here and my conclusions are wrong.

OptimumCode commented 7 months ago

Hi, @Julian @gregsdennis Could you please take a look at the issue and share your thoughts about it? Or maybe you could add somebody else from contributors who could review the issue. Thank you!

gregsdennis commented 7 months ago

I'm not sure what the issue is here. Is there a technical reason this character needs to be added (i.e. is this testing something that's not currently covered?), or is it merely a linguistic problem?

Julian commented 7 months ago

@OptimumCode I think but haven't re-checked that this is the same as #675 -- can you perhaps have a look there, I recall having a look into this and concluding it seemed correct as is to me (and OP on that issue didn't follow up). But yeah perhaps let me know if you see something wrong there and/or whether you agree you're referring to the same thing.

OptimumCode commented 7 months ago

Thank you, @Julian. I did not see this issue when tried to look for an existing one. Yes, this issue seems to be similar to this one except the author was wrong about KATAKANA MIDDLE DOT cases.

I am not an expert here either so I might be wrong somewhere. But let me try to explain why I think this change should be made.

Answering @gregsdennis question: This change should be made for a technical reason. Right now this string is an invalid idn-hostname for two reasons:

These are 3 tests for Arabic-Indic digits and Extended Arabic-Indic digits in the JSON schema test suite. One test has them mixed and two tests have only either Arabic-Indic digits or Extended Arabic-Indic digits. Let's now imagine an implementation that does not check for mixed Arabic-Indic digits and Extended Arabic-Indic digits at all. Two later tests will pass because nothing is wrong with them. The first test will fail because it violates the Bidi rule. As a result, all tests for idn-hostname are green.

Now the hardest part - why does it violate the Bidi rule?

According to paragraph 4.2.3.4 in RFC5891 the labels that have right-to-left characters MUST meet Bidi criteria

If the proposed label contains any characters from scripts that are written from right to left, it MUST meet the Bidi criteria [RFC5893]

What is right-to-left is explained in RFC5893 which is referenced above. In section 1.4 we can find definition or right-to-left:

An RTL label is a label that contains at least one character of type R, AL, or AN.

In our case, it contains characters with the types AN (Arabic Number) and UN (European Number). This would mean that it is an RTL label.

Also, the header for RFC5893 is Right-to-Left Scripts for Internationalized Domain Names for Applications (IDNA). Because of that, I believe the Bidi domain name and IDN hostname with RTL label are the same thing. As a result, the Bidi rule should be applied here.

Could you please share your thoughts on this considering all the above?

There is also one interesting thing: if we take a look at Bidi criteria in point 4 we will see that mix of AN and EN in RTL label is not allowed

In an RTL label, if an EN is present, no AN may be present, and vice versa.

This would make a mix of Arabic-Indic digits and Extended Arabic-Indic digits invalid in any case because Extended Arabic-Indic digits have type EN. Considering this, the string will still fail the Bidi rule even after adding the first character with type AL. But at least it would violate the 4th point of Bidi criteria

gregsdennis commented 7 months ago

Okay. Thanks for the explanation. I just didn't know about these kinds of language issues, and it appeared like it might just be a language thing.

OptimumCode commented 7 months ago

Hi, could you please advise what should I do with PR?

If you disagree with some points in my explanation please let me know - I will try to elaborate.

Also, what do you think about adding test cases for Bidi rule violations? (as a separate PR)

Julian commented 7 months ago

It will probably be a few days until I can reread the spec but haven't forgotten.

OptimumCode commented 7 months ago

Thank you, @Julian. If you need something from my side please let me know

Julian commented 7 months ago

In our case, it contains characters with the types AN (Arabic Number) and UN (European Number). This would mean that it is an RTL label.

Took another quick look today -- yes I think you look correct, the other four examples in #675 were not RTL I think (they weren't R, AL or AN as you quoted) but this one is, so yeah I agree this one is indeed a bidi label and needs to follow the rule.

OptimumCode commented 7 months ago

Thank you, @Julian for reviewing this