Unicode with prefix fails to pass

pietercolpaert commented 7 years ago

Our current prefixed regex /^((?:[A-Za-z\xc0-\xd6\xd8-\xf6])(?:\.?[\-0-9A-Z_a-z\xb7\xc0-\xd6\xd8-\xf6])*)?:((?:(?:[0-:A-Z_a-z\xc0-\xd6\xd8-\xf6]|%[0-9a-fA-F]{2}|\\[!#-\/;=?\-@_~])(?:(?:[\.\-0-:A-Z_a-z\xb7\xc0-\xd6\xd8-\xf6]|%[0-9a-fA-F]{2}|\\[!#-\/;=?\-@_~])*(?:[\-0-:A-Z_a-z\xb7\xc0-\xd6\xd8-\xf6]|%[0-9a-fA-F]{2}|\\[!#-\/;=?\-@_~]))?)?)(?:[ \t]+|(?=\.?[,;!\^\s#()\[\]\{\}"'<]))/ does not match valid TriG entities like c:テスト.

This prefixed regex is defined in the N3 lexer on line 68: https://github.com/pietercolpaert/hardf/blob/master/src/N3Lexer.php#L68

The reason why I had to simplify the regex is because PHP does not allow unicode escape sequences in PCRE regular expressions... Are there any alternatives?

Original issue was found by Kanzaki Masahide:

I found that TriGParser fails to handle non-ASCII prefixed names, e.g.

@prefix c: <http://example.org/>.
c:test a c:テスト .

While it's OK to parse IRI :

@prefix c: <http://example.org/>.
c:test a <http://example.org/テスト> .

Note N3.js can parse both properly.

RubenVerborgh commented 7 years ago

Note that the regular expressions inside N3.js are tailored to the JavaScript implementation of Unicode. They represent some Unicode code points as two glyphs instead of one (https://mathiasbynens.be/notes/javascript-unicode).

I don't know how PHP deals with this, but it might thus be that the expressions are different. Its things like [\ud800-\udb7f][\udc00-\udfff] (representing the Turtle spec's [#x10000-#xEFFFF]) you should be on the lookout for.

joetm commented 7 years ago

Maybe try a multibyte regex? http://php.net/manual/en/function.mb-ereg.php

mkanzaki commented 7 years ago

Hi, current prefixed regex doesn't work: it attempts to match octet by octet, but in that case, the string is evaluated in its encoded form, e.g. UTF-8. Therefore, code point base pattern would not match.

In order to use Unicode character base match, the pattern needs "u" switch. "Old version" would work if:

change \uHHHH to \\x{HHHH}
add "u" after final "/" (i.e. pattern /.../u)
delete[\ud800-\udb7f][\udc00-\udfff] subpatterns (these are not part of Turtle grammer)

and replace all strlen() and substr() with mb_strlen(), mb_substr() respectively to match characters rather than octets (some of them could be non mb_ version).

I also tried to define regex pattern based on Turtle spec 6.5 grammer.

define("HEX", "[0-9A-Fa-f]");
define("PERCENT", "%". HEX .HEX);
define("PN_LOCAL_ESC", "\\[_~\.\-!\$&'\|\(\)\*\+,;=\/\?#@%]");
define("PLX", PERCENT ."|". PN_LOCAL_ESC);
define("PN_CHARS_BASE", "A-Za-z\\x{00C0}-\\x{00D6}\\x{00D8}-\\x{00F6}\\x{00F8}-\\x{02FF}\\x{0370}-\\x{037D}\\x{037F}-\\x{1FFF}\\x{200C}-\\x{200D}\\x{2070}-\\x{218F}\\x{2C00}-\\x{2FEF}\\x{3001}-\\x{D7FF}\\x{F900}-\\x{FDCF}\\x{FDF0}-\\x{FFFD}\\x{10000}-\\x{EFFFF}");
define("PN_CHARS_U", PN_CHARS_BASE. '_');
define("PN_CHARS", PN_CHARS_U. "\-0-9\\x{00B7}\\x{0300}-\\x{036F}\\x{203F}-\\x{2040}");
define("PN_PREFIX", "[".PN_CHARS_BASE. "](?:[".PN_CHARS ."\.]*[". PN_CHARS."])?");
define("PN_LOCAL", "(?:[".PN_CHARS_U .":0-9]|". PLX .")(?:[".PN_CHARS ."\.:]|". PLX.")*(?:[".PN_CHARS .":]|". PLX.")?");
define("PNAME_NS", "(".PN_PREFIX.")?:");
define("PNAME_LN", PNAME_NS . "(".PN_LOCAL.")");
define("PREFIXED_NAME_RE", "/^(?:".PNAME_LN ."|". PNAME_NS.")/u");

A little bit verbose, but might be easier to understand.

I experimented to put these definitions after the namespace decl, and replace $this->prefixed with PREFIXED_NAME_RE in if($inconclusive) conditonal preg_match, then it worked.

pietercolpaert commented 7 years ago

@mkanzaki Your 3 steps work!

Could you however come up with a test that would fail when not using mb_strlen?

mkanzaki commented 7 years ago

hmm, simple strlen() and substr() look to work. While I tried to find problems by changing some parts, a few cases were fixed by mb_strlen(), but probably they happend to pass and not fundamental issues. Sorry.

I guess no test for mb_strlen needed now.

pietercolpaert commented 7 years ago

Thank you very much for putting the time and effort to check all this! Version 0.1.1 is now released with unicode support in prefixed names.

pietercolpaert / hardf

Unicode with prefix fails to pass #7