Closed pietercolpaert closed 7 years ago
Note that the regular expressions inside N3.js are tailored to the JavaScript implementation of Unicode. They represent some Unicode code points as two glyphs instead of one (https://mathiasbynens.be/notes/javascript-unicode).
I don't know how PHP deals with this, but it might thus be that the expressions are different. Its things like [\ud800-\udb7f][\udc00-\udfff]
(representing the Turtle spec's [#x10000-#xEFFFF]
) you should be on the lookout for.
Maybe try a multibyte regex? http://php.net/manual/en/function.mb-ereg.php
Hi, current prefixed regex doesn't work: it attempts to match octet by octet, but in that case, the string is evaluated in its encoded form, e.g. UTF-8. Therefore, code point base pattern would not match.
In order to use Unicode character base match, the pattern needs "u" switch. "Old version" would work if:
\uHHHH
to \\x{HHHH}
/.../u
)[\ud800-\udb7f][\udc00-\udfff]
subpatterns (these are not part of Turtle grammer)and replace all strlen()
and substr()
with mb_strlen()
, mb_substr()
respectively to match characters rather than octets (some of them could be non mb_ version).
I also tried to define regex pattern based on Turtle spec 6.5 grammer.
define("HEX", "[0-9A-Fa-f]");
define("PERCENT", "%". HEX .HEX);
define("PN_LOCAL_ESC", "\\[_~\.\-!\$&'\|\(\)\*\+,;=\/\?#@%]");
define("PLX", PERCENT ."|". PN_LOCAL_ESC);
define("PN_CHARS_BASE", "A-Za-z\\x{00C0}-\\x{00D6}\\x{00D8}-\\x{00F6}\\x{00F8}-\\x{02FF}\\x{0370}-\\x{037D}\\x{037F}-\\x{1FFF}\\x{200C}-\\x{200D}\\x{2070}-\\x{218F}\\x{2C00}-\\x{2FEF}\\x{3001}-\\x{D7FF}\\x{F900}-\\x{FDCF}\\x{FDF0}-\\x{FFFD}\\x{10000}-\\x{EFFFF}");
define("PN_CHARS_U", PN_CHARS_BASE. '_');
define("PN_CHARS", PN_CHARS_U. "\-0-9\\x{00B7}\\x{0300}-\\x{036F}\\x{203F}-\\x{2040}");
define("PN_PREFIX", "[".PN_CHARS_BASE. "](?:[".PN_CHARS ."\.]*[". PN_CHARS."])?");
define("PN_LOCAL", "(?:[".PN_CHARS_U .":0-9]|". PLX .")(?:[".PN_CHARS ."\.:]|". PLX.")*(?:[".PN_CHARS .":]|". PLX.")?");
define("PNAME_NS", "(".PN_PREFIX.")?:");
define("PNAME_LN", PNAME_NS . "(".PN_LOCAL.")");
define("PREFIXED_NAME_RE", "/^(?:".PNAME_LN ."|". PNAME_NS.")/u");
A little bit verbose, but might be easier to understand.
I experimented to put these definitions after the namespace decl, and replace $this->prefixed
with PREFIXED_NAME_RE
in if($inconclusive) conditonal preg_match, then it worked.
@mkanzaki Your 3 steps work!
Could you however come up with a test that would fail when not using mb_strlen?
hmm, simple strlen() and substr() look to work. While I tried to find problems by changing some parts, a few cases were fixed by mb_strlen(), but probably they happend to pass and not fundamental issues. Sorry.
I guess no test for mb_strlen needed now.
Thank you very much for putting the time and effort to check all this! Version 0.1.1 is now released with unicode support in prefixed names.
Our current prefixed regex
/^((?:[A-Za-z\xc0-\xd6\xd8-\xf6])(?:\.?[\-0-9A-Z_a-z\xb7\xc0-\xd6\xd8-\xf6])*)?:((?:(?:[0-:A-Z_a-z\xc0-\xd6\xd8-\xf6]|%[0-9a-fA-F]{2}|\\[!#-\/;=?\-@_~])(?:(?:[\.\-0-:A-Z_a-z\xb7\xc0-\xd6\xd8-\xf6]|%[0-9a-fA-F]{2}|\\[!#-\/;=?\-@_~])*(?:[\-0-:A-Z_a-z\xb7\xc0-\xd6\xd8-\xf6]|%[0-9a-fA-F]{2}|\\[!#-\/;=?\-@_~]))?)?)(?:[ \t]+|(?=\.?[,;!\^\s#()\[\]\{\}"'<]))/
does not match valid TriG entities likec:テスト
.This prefixed regex is defined in the N3 lexer on line 68: https://github.com/pietercolpaert/hardf/blob/master/src/N3Lexer.php#L68
The reason why I had to simplify the regex is because PHP does not allow unicode escape sequences in PCRE regular expressions... Are there any alternatives?
Original issue was found by Kanzaki Masahide:
I found that TriGParser fails to handle non-ASCII prefixed names, e.g.
While it's OK to parse IRI :
Note N3.js can parse both properly.