lanthaler / JsonLD

JSON-LD processor for PHP
MIT License
336 stars 61 forks source link

NQuads.php: made blank node regex more flexible #104

Closed k00ni closed 2 years ago

k00ni commented 2 years ago

As far as I understand https://www.w3.org/TR/n-quads/#BNodes, blank node labels in NQuads can include -, . and _ too. Specification requires these characters not being at the beginning of the label, but this regex is not structured to check that anyway.

lanthaler commented 2 years ago

The current regex is too restrictive as you point out. The proposed new one is too liberal. Would you mind updating it to match the spec?

k00ni commented 2 years ago

I updated the regex and added a few tests to prove they work. Is that approach OK?

Any idea how to handle the following characters? U+00B7, U+0300 to U+036F and U+203F to U+2040

lanthaler commented 2 years ago

You should be able to match then with \x{00B7} etc. If I read the regex in the change correctly, it would disallow IDs such as _:b as it always requires at least two characters. You'll have to make the second oder optional.

k00ni commented 2 years ago

_:b is covered now.

To include U+00B7, U+0300 to U+036F and U+203F to U+2040, I had to use UTF-8 mode in regex. This lead to failing tests because at Processor.php line 2895 it tries to use $property, which is an empty string sometimes. I have no idea why that is.

1) ML\JsonLD\Test\NQuadsTest::testEscaping
Error: Cannot access empty property

/var/www/html/Processor.php:2895
/var/www/html/Processor.php:2322
/var/www/html/JsonLD.php:465
/var/www/html/Test/NQuadsTest.php:45
phpvfscomposer:///var/www/html/vendor/phpunit/phpunit/phpunit:51

Without using UTF-8 mode in regex I get the following error:

preg_match(): Compilation failed: character value in \x{} or \o{} is too large at offset 76

We have two options as far as I can tell:

  1. Either fix this "behavior" which leads to this error in Processor.php and keep UTF-8 characters.
  2. Or don't include UTF-8 characters in regex.

WDYT?

k00ni commented 2 years ago

How do we proceed here @lanthaler?

lanthaler commented 2 years ago

Thanks for the ping, I missed your previous comment. I'd suggest we remove the UTF-8 characters for now. We can split that of into a separate bug.

k00ni commented 2 years ago

I removed my UTF-8 extensions.

After running PHPUnit (both PHP 5.6 and 7.4), I ran into a couple of errors and failures, such as:

1) ML\JsonLD\Test\W3CTestSuiteTest::testRemoteDocumentLoading with data set "http://json-ld.org/test-suite/tests/remote-doc-manifest.jsonld#t0012" ('Multiple context link headers', stdClass Object (...), stdClass Object (...))
Failed asserting that 'loading document failed' is equal to expected exception code 'multiple context link headers'.

phpvfscomposer:///var/www/html/vendor/phpunit/phpunit/phpunit:51

or

10) ML\JsonLD\Test\W3CTestSuiteTest::testRemoteDocumentLoading with data set "http://json-ld.org/test-suite/tests/remote-doc-manifest.jsonld#t0011" ('load JSON document with exten...h link', stdClass Object (...), stdClass Object (...))
ML\JsonLD\Exception\JsonLdException: Unable to load the remote document "https://json-ld.org:443/test-suite/tests/remote-doc-0011-in.jldt".

/var/www/html/FileGetContentsLoader.php:62
/var/www/html/JsonLD.php:142
/var/www/html/Test/W3CTestSuiteTest.php:182
phpvfscomposer:///var/www/html/vendor/phpunit/phpunit/phpunit:51

My changes are not the source of these errors because they persisted after reverting ALL of my changes. Unfortunately, there is no CI for pull requests to have a neutral view on the matter. I assume there were changes in http://json-ld.org/test-suite/tests/remote-doc-manifest.jsonld in the meantime which broke your tests.

I leave it to you how to proceed.

lanthaler commented 2 years ago

Oh, I didn't realize that CI is broken. Travis was running on PR before. Your code LGTM. I'll have a look at the tests and the CI setup (likely on the weekend) and merge your code.

Thanks a lot for your work on this Konrad.

k00ni commented 2 years ago

[...] and the CI setup (likely on the weekend)

I recommend to use Github Actions instead of Travis in the future.

In the past I had similar problems like these guys (ref, ref2). Basically, Travis was not reliable anymore.

lanthaler commented 2 years ago

Could you please allow edits to this PR? I've found the bug and would like to push it back to the PR before merging it to have the full history here

Btw. I've also set up CI using GitHub Actions and added a devcontainer for easier development

k00ni commented 2 years ago

Could you please allow edits to this PR?

It is allowed already. "Allow edits by maintainers" is checked.

Bildschirmfoto von 2022-09-29 08-30-26

k00ni commented 2 years ago

Btw. you mean the bug which stopped us from using UTF-8 symbols in the regex?

lanthaler commented 2 years ago

No, the bug that caused tests to fail. I didn't look into the UTF-8 symbols yet.

Thanks for working on this!