openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
868 stars 210 forks source link

Error in handling of Unicode characters with SPARQL CONCAT function #944

Closed jakubklimek closed 4 months ago

jakubklimek commented 3 years ago

There is an issue with handling of Unicode characters with combination of SPARQL CONCAT and ENCODE_FOR_URI functions.

When used like this: BIND(CONCAT("https://c/é/", ENCODE_FOR_URI("Á")) as ?c), the resulting literal is https://c/\u00E9/\u00C3\u0081%00 which, when decoded, is https://c/é/Á, which is wrong.

When used like this (note that in the first string in CONCAT, I replace é with e: BIND(CONCAT("https://d/e/", ENCODE_FOR_URI("Á")) as ?d), the resulting literal is https://d/e/\u00C1%00, which, when decoded, is https://d/e/Á, which is correct. Not sure whether the problem is in the CONCAT or the ENCODE_FOR_URI function.

This query can be run on https://data.gov.cz/sparql or https://dev.nkod.opendata.cz/sparql:

CONSTRUCT {
  ?a ?b ?c, ?d  .
}
WHERE {
  ?test a dcat:Dataset .
  BIND(IRI(CONCAT("https://a/é/", ENCODE_FOR_URI("Á"))) as ?a)
  BIND(IRI(CONCAT("https://b/e/", ENCODE_FOR_URI("Á"))) as ?b)
  BIND(CONCAT("https://c/é/", ENCODE_FOR_URI("Á")) as ?c)
  BIND(CONCAT("https://d/e/", ENCODE_FOR_URI("Á")) as ?d)
}

Note that ?test a dcat:Dataset . is not necessary and can be replaced by anything which matches something in the graph. It could be omitted, but that triggers this 6,5 years old issue: https://github.com/openlink/virtuoso-opensource/issues/231 when run directly on the Virtuoso SPARQL Endpoint.

When run in Yasgui (https://api.triplydb.com/s/8E0WDV550), it works even without this.

TallTed commented 3 years ago

@jakubklimek -- Note that you can drop the ?test a dcat:Dataset . pattern and run the query against both of your listed endpoints, if you either un-tick the box for "Strict checking of void variables" on the SPARQL query form (as noted in the comments on @231) or insert define sql:signal-void-variables 0 before CONSTRUCT in your query. (The define option also works through saved URLs, as on data.gov.cz (query, results) or dev.nkod.opendata.cz (query, results).)

Is there a reason you're using a CONSTRUCT query to test, instead of a SELECT? (At a quick glance, the encoding issue appears to happen in both; I just want to be sure I'm not missing something.)

@smalinin @pkleef @iv-an-ru -- Please take a look at this.

jakubklimek commented 3 years ago

@TallTed thanks, I knew there was a workaround for this somewhere.

I used CONSTRUCT just because that is how I discovered the bug and went on minimizing the example, no other reason.

jakubklimek commented 3 years ago

I ran into this issue even without ENCODE_FOR_URI. It therefore seems to be contained to CONCAT. Whenever there is a unicode character used in CONCAT, the result is badly encoded:

PREFIX dcat: <http://www.w3.org/ns/dcat#>

SELECT ?changed WHERE {
  ?dataset a dcat:Dataset .
  BIND(CONCAT("ě", ?dataset) AS ?changed)
}
LIMIT 1

— produces ěhttps://data.gov.cz/zdroj/datové-sady/https---isdv.upv.cz-opendata-upv-package_show-id-vz20210307diff while —

PREFIX dcat: <http://www.w3.org/ns/dcat#>

SELECT ?changed WHERE {
  ?dataset a dcat:Dataset .
  BIND(CONCAT("e", ?dataset) AS ?changed)
}
LIMIT 1

— produces ehttps://data.gov.cz/zdroj/datové-sady/https---isdv.upv.cz-opendata-upv-package_show-id-vz20210307diff

(note the first character and then datové-sady vs datové-sady)

jakubklimek commented 3 years ago

Still happening in https://github.com/openlink/virtuoso-opensource/commit/8baf8a90afc842c52b7d2f44af0ca99c88d85b68

jakubklimek commented 3 years ago

Still happening in a7b01eced76532f1fa36fdf665f9f836531bdae0

TallTed commented 3 years ago

@smalinin @pkleef @iv-an-ru @hughwilliams @openlink -- Any estimate of when this will be investigated, if not resolved? It seems likely to be causing trouble if not blocking a good number of deployments where Unicode is in broader use.

jakubklimek commented 2 years ago

@pkleef any chance of looking into this when you are dealing with unicode related issues? :)

jakubklimek commented 4 months ago

Still happening in 99e4f122c5. @HughWilliams any chance of fixing this? It is a really annoying issue.

pkleef commented 4 months ago

We fixed this problem in commit 06ac26454d060339de4fab69a8ef3e27a4abc946 and c7f420a8dc6b1a437a1ef9a37f44ca8192f9786c. Please check out the latest develop/7 branch.

jakubklimek commented 4 months ago

@pkleef Thanks, seems to work fine now.

TallTed commented 4 months ago

Closing based on https://github.com/openlink/virtuoso-opensource/issues/944#issuecomment-2178724924