openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
868 stars 210 forks source link

ISQL CSV encoding is weird... #421

Open joernhees opened 9 years ago

joernhees commented 9 years ago

Is it just me or is the isql CSV encoding weird?

SQL> SET CSV=ON;
SQL> sparql select ?s ("hallo you; % jörn" as ?foo) ?p  where {?s ?p ?o} limit 1;
s;foo;p
http://www.openlinksw.com/virtrdf-data-formats#default-iid;hallo you%3B %25 j%FFF6rn;http://www.w3.org/1999/02/22-rdf-syntax-ns#type

1 Rows. -- 2 msec.

I guess that this is some kind of ASCII-%-URI-like-encoding, but it's not very parseable, especially the " % " being equal to "%3B %25 " and the "ö" being "%FFF6".

roelj commented 4 years ago

Digging up an old issue..

I am wondering how the encoding actually works. I have a value in which a single ö is turned into %FFC3%FFB6. The command I used is:

isql 1111 csv=on exec='SPARQL SELECT ?s ?o FROM <some-graph> WHERE { ?s rdfs:label ?o }'

Not using the csv=on produces the ö just fine:

isql 1111 exec='SPARQL SELECT ?s ?o FROM <some-graph> WHERE { ?s rdfs:label ?o }'
HughWilliams commented 4 years ago

The "isql" SET CSV_RFC4180 ON; command is the correct way to enable CSV results output:

SQL> SET CSV_RFC4180 ON;
SQL> SPARQL SELECT * WHERE {?s ?p ?o} LIMIT 2;
"s","p","o"
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type","http://www.w3.org/1999/02/22-rdf-syntax-ns#type","http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type","http://www.w3.org/1999/02/22-rdf-syntax-ns#type","http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"

2 Rows. -- 81 msec.
SQL>
roelj commented 4 years ago

That doesn't seem to affect the encoding:

/opt/sparqling-genomics/bin/isql 1111 verbose=off csv_rfc4180=on csv_rfc4180_field_separator=, exec='SPARQL SELECT ?s ?o FROM <some-graph> WHERE { ?s rdfs:label ?o }'

Produces the same %FFC3%FFB6.

HughWilliams commented 4 years ago

We shall look into this encoding issue in "isql" ...

roelj commented 4 years ago

Thanks a lot!

Perhaps an easy way to avoid breaking anything is to introduce a no_encode option in binsrc/tests/isql.c and use field_print_normal in field_print_csv_rfc4180 when no_encode is set to on.

If you'd like I could prepare a patch that implements this.

HughWilliams commented 4 years ago

@roelj: patch contributions are always welcomed ...

roelj commented 4 years ago

Just letting you know that my initial idea does not seem to have any effect: https://github.com/roelj/virtuoso-opensource/commit/4908939dab82bde9bb4f60592d38a047f7220a2a. So I'm working on a second version of the patch.

mjeulin commented 2 months ago

Hi I too am bringing up this very old issue again because today (in 2024) I'm facing the same problem with an isql SELECT in command line: /usr/local/virtuoso-opensource-7.2.9/bin/isql 1111 dba dba exec='CSV_RFC4180=ON exec='SPARQL SELECT (…long query!...) ;' > result.csv CSV_RFC4180 gives a properly formatted CSV file but causes this kind of bad encoding:

Universit%FFC3%FFA9 de Bourgogne Development of electrochemical biosensor based on CNT%FFE2%FF80%FF93Fe3O4 nanocomposite …

This makes the output unusable. Without CSV_RFC4180, the encoding is correct, but the file is no longer in CSV format... Version used:Virtuoso 07.20.3217 from February 2017 (Open-Source edition).

Has this problem been fixed ever since ? If so, what is the correct way to proceed? Thank you

TallTed commented 2 months ago

@mjeulin — I am not immediately aware of any change specifically related to your reported issue in the seven years since shipment of the version(s?) you're running. Nonetheless, I would strongly advise updating to a current build, because there have been hundreds of code changes in that time, including various bug fixes, performance boosts, and feature enhancements, from all of which all users will benefit.

I note that you reported using Virtuoso 07.20.3217 from February 2017 (which was branded as virtuoso-opensource-7.2.4.2), but your installation directory shows the much younger (though still rather old, as software ages!) virtuoso-opensource-7.2.9 which dates from February 2023. I wonder whether you have a mix of components from these (and possibly other) distributions, which component version mixes are untested and therefore could lead to any number of odd experiences.

If the issue described here persists in your testing with current components, you can help us deliver a resolution by providing step-by-step instruction for our own local reproduction. It probably makes sense to make such an updated report in a fresh issue, to avoid any confusion with details from the other deployments discussed here in #421.

mjeulin commented 1 month ago

Thank you for your answer. There is indeed a dual installation on this server, so my requests may actually only calling version 7.2.4.2... The issue may be resolved in a near future with a cleaner re-installation. If not, I will get back (on a new issue).

joernhees commented 1 week ago

@TallTed maybe close this if fixed or as wont't fix then? (should be easy enough to check / reproduce on a current version for you?)

looking at this and knowing the various encoding screw-ups that happen in the many system with the "ö" in my name, i guess the csv is returning urlencoded unicode or utf-8 codepoints 🤷‍♂️ ?

TallTed commented 1 week ago

@joernhees — Something like this issue persists. The % comes out as desired, but the ö becomes the very strange %FFF6.

I've put a fresh install of VOS 7.1.13 (latest as of today) (that is, Version 07.20.3240-pthreads for Mac OS 11 (Intel x86_64) as of Jun 10 2024 (a1fd8195b)) on macOS 10.14.6 (18G9323)

Using a simplified query (so the output is easier to parse at a glance) shows your issue with the CSV output on the SPARQL query, but I think my second SQL query shows the issue is lies deeper —

SQL> SET CSV_RFC4180 ON;
SQL> sparql SELECT ("hallo you; % jörn" AS ?foo) WHERE {?s ?p ?o} LIMIT 1;
"foo"
"hallo you; % j%FFF6rn"

1 Rows. -- 2 msec.
SQL> SET CSV_RFC4180 OFF;
SQL> sparql SELECT ("hallo you; % jörn" AS ?foo) WHERE {?s ?p ?o} LIMIT 1;
foo
LONG VARCHAR
_______________________________________________________________________________

hallo you; % j?rn

1 Rows. -- 3 msec.
SQL> quit