Content-encoding SPARQL query (België) #772

Open coret opened 1 year ago

coret commented 1 year ago

When searching for België in the GTAA no results are given, whilst searching for Belgie has among othersBelgië as result.

Testing by @wmelder showed the following:

The query for België via the construct_gtaa.rq query run via curl -H Accept:text/turtle --data-urlencode "query@queries/construct_gtaa.rq" 'https://{username}:{password}' yields no results, but curl -H "Content-type: application/x-www-form-urlencoded; charset=utf-8" -H Accept:text/turtle --data-urlencode "query@queries/construct_gtaa.rq" 'https://{username}:{password}' does give results!

It seems the Comunica client (Network of Terms) sends UTF-8, but doesn't include a character encoding header, so server-side it's considered US-ASCII (ISO-8859-1).

Should / can the charset be part of the dataset description of the GTAA within the Network of Terms (client-side solution). Of, should a default charset (utf-8) be hardcoded in the Comunica call with the option to override via de dataset description?

Some other searches which have problems with searching for terms with diacritics: Ampèrestraat (Adamlink) and Curaçaostraat (Gouda Tijdmachine). Haven't checked if adding a charset helps with these sources.

Some other search which do not have a problem with searching for terms with diacritics: Eichstätt (WO2 thesaurus), Galileïsche (AAT), Henriëtte (RKDartists)

wmelder commented 1 year ago

Adding a hardcoded charset in the HTTP header would suffice. Otherwise, the receiving server doesn't know what type encoding is sent.

wmelder commented 1 year ago

wmelder commented 1 year ago

On second thoughts... what if the server doesn't handle the charset properly? Or doesn't have an UTF-8 default encoding? Then it would be nice if network of terms can provide a charset that the server will handle properly. In those cases a dataset parameter should be necessary.

ddeboer commented 1 year ago

What is construct_gtaa.rq and where can I find it?

wmelder commented 1 year ago

@ddeboer construct_gtaa.rq is basically the gtaa.rq query, but it may include VALUES for query and datasetUri, variables that are filled in from within the network of terms. To be able to use a test query file we renamed it. In itself not so exciting.

wmelder commented 1 year ago

Currently these are the contents of the file:

PREFIX skos: <>
PREFIX justskos: <>
PREFIX text: <>

    ?uri a skos:Concept ;
        skos:prefLabel ?prefLabel ;
        skos:altLabel ?altLabel ;
        skos:hiddenLabel ?hiddenLabel ;
        skos:scopeNote ?scopeNote ;
        skos:broader ?broader_uri ;
        skos:narrower ?narrower_uri ;
        skos:related ?related_uri .
    ?broader_uri skos:prefLabel ?broader_prefLabel .
    ?narrower_uri skos:prefLabel ?narrower_prefLabel .
    ?related_uri skos:prefLabel ?related_prefLabel .
    VALUES ?query { "zelensky" }
    VALUES ?datasetUri {
    ?uri text:query (skos:prefLabel skos:altLabel skos:hiddenLabel ?query) .
    ?uri skos:inScheme ?datasetUri ;
        justskos:status ?status .
    FILTER(?status IN ('approved', 'candidate'))

        ?uri skos:prefLabel ?prefLabel .
        FILTER(LANG(?prefLabel) = "nl" )
        ?uri skos:altLabel ?altLabel .
        FILTER(LANG(?altLabel) = "nl")
        ?uri skos:hiddenLabel ?hiddenLabel .
        FILTER(LANG(?hiddenLabel) = "nl")
        ?uri skos:scopeNote ?scopeNote .
        FILTER(LANG(?scopeNote) = "nl")
        ?uri skos:broader ?broader_uri .
        ?broader_uri skos:prefLabel ?broader_prefLabel .
        FILTER(LANG(?broader_prefLabel) = "nl")
        ?uri skos:narrower ?narrower_uri .
        ?narrower_uri skos:prefLabel ?narrower_prefLabel .
        FILTER(LANG(?narrower_prefLabel) = "nl")
        ?uri skos:related ?related_uri .
        ?related_uri skos:prefLabel ?related_prefLabel .
        FILTER(LANG(?related_prefLabel) = "nl")
LIMIT 1000
wmelder commented 1 year ago

For this issue it should be modified a bit:

    VALUES ?query { "België" }
    VALUES ?datasetUri {
ddeboer commented 1 year ago

For previous work on diacritics, see, and At least for Virtuoso sources (Adamlink), how diacritics are interpreted is out of our control.

wmelder commented 1 year ago

In de sparql doc staat dat een POST met application/sparql-query altijd in UTF-8 is. Maar bij een POST met x-www-form-urlencoded staat dat er niet bij. Mogelijk beter om de application/sparql-query variant te gebruiken (met unescaped UTF-8 dus).

tip van onze ontwikkelaars...