openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
867 stars 210 forks source link

Incorrect IRI handling in fct when doing content negotiation #1058

Open jakubklimek opened 2 years ago

jakubklimek commented 2 years ago

I use IRIs in my RDF and fct to browse and do content negotiation, e.g. for text/turtle. However, there is an issue with inconsistent handling of unicode characters in Location: header in the HTTP redirect. When I do: curl -i -H "Accept: text/turtle" https://linked.opendata.cz/resource/knowledge-graph-browser/view/uk/nadřazená-pracoviště I get: location: https://linked.opendata.cz/sparql?query=define%20sql%3Adescribe-mode%20%22CBD%22%20%20DESCRIBE%20%3Chttps%3A%2F%2Flinked.opendata.cz%2Fresource%2Fknowledge-graph-browser%2Fview%2Fuk%2Fnad%C5%99azená-pracovi%C5%A1t%C4%9B%3E&format=text%2Fturtle Note the á there - all the unicode characters are percent encoded, but not á. This causes problems with libraries expecting ASCII string, such as those implementing the fetch API, e.g. https://www.npmjs.com/package/node-fetch

There is an nginx reverse proxy on the way doing:

 location /resource/ {
                include hsts-cors.conf;
                proxy_pass http://127.0.0.1:8890/describe/?url=https://linked.opendata.cz$uri;
                proxy_set_header   Host             $host;
                proxy_set_header   X-Real-IP        $remote_addr;
                proxy_set_header   X-Forwarded-For  $proxy_add_x_forwarded_for;
                proxy_pass_request_headers      on;
                proxy_redirect http://linked.opendata.cz https://linked.opendata.cz ;

                sub_filter_once off;
                sub_filter 'href="http://linked.opendata.cz' 'href="https://linked.opendata.cz';
                sub_filter 'src="http://linked.opendata.cz' 'src="https://linked.opendata.cz';
        }

When I tried tunneling to the server to avoid it, I got even worse result:

curl -i -H "Accept: text/turtle" "http://localhost:8890/describe/?url=https://linked.opendata.cz/resource/knowledge-graph-browser/view/uk/nadřazená-pracoviště"
HTTP/1.1 303 See Other
Server: Virtuoso/07.20.3233 (Linux) x86_64-pc-linux-gnu
Connection: Keep-Alive
Content-Type: text/html; charset=UTF-8
Date: Thu, 14 Jul 2022 06:45:56 GMT
Accept-Ranges: bytes
TCN: choice
Vary: negotiate,accept
Location: http://localhost:8890/sparql?query=define%20sql%3Adescribe-mode%20%22CBD%22%20%20DESCRIBE%20%3Chttps%3A%2F%2Flinked.opendata.cz%2Fresource%2Fknowledge-graph-browser%2Fview%2Fuk%2Fnadrazen%3F-pracovi%3Fte%3E&format=text%2Fturtle

The IRI here url-decoded is: https://linked.opendata.cz/resource/knowledge-graph-browser/view/uk/nadrazen?-pracovi?te - ř becomes r, and é and š are replaced by ?.

pkleef commented 2 years ago

Thanks for the report. We will have a look at what is going on there.