w3c / sparql-results-csv-tsv

https://w3c.github.io/sparql-results-csv-tsv/
Other
0 stars 1 forks source link

TSV: state how to handle special characters in strings #10

Open Tpt opened 1 year ago

Tpt commented 1 year ago

The specification does not explicitely states how quotes and ASCII control characters (\0...) should be escaped. It might be nice to add some sentences about it.

A note to state that the " quote should be prefered to the ' quote might also be nice to get some kind of "canonical" TSV serialization.

TallTed commented 1 year ago

A note to state that the " quote should be prefered to the ' quote might also be nice to get some kind of "canonical" TSV serialization.

This preference is commonly dictated by the data. If my data has lots of " characters and few or no ', I'd prefer to use the ' to quote each field, minimizing the need for inline escapes.

Putting some guidance like yours into a distinct "notes on canonicalization" section would probably be OK.

domel commented 1 year ago

I think that it is bad idea to change / overwrite the basic spec that is used. TSV has no official spec, but CSV has. And in that spec there is no information about '. It recommends to use double quotes (or nothing).

afs commented 1 year ago

TSV does have an official spec! https://www.iana.org/assignments/media-types/text/tab-separated-values

afs commented 1 year ago

In TSV, the quotes and escapes are from the RDF term writing.

https://w3c.github.io/sparql-results-csv-tsv/spec/index.html#tsv-terms "by using the syntax that SPARQL and Turtle use."

From what I see on the web, in Turtle, " is more common.

Each needs escaping checking (' in names seems to catch data writing system out).

Some advice-text would be useful - less than formal, single-choice canonicalization.

domel commented 1 year ago

TSV does have an official spec! https://www.iana.org/assignments/media-types/text/tab-separated-values

Yes and no. It's rather a documentation for media type than official spec (that is RFC or STD). Regardless of the naming, there is nothing about '.

afs commented 1 year ago

And in that spec there is no information about '. It recommends to use double quotes (or nothing).

This issue (#10) is specific to TSV. For CSV, we should, of course, use ".

afs commented 8 months ago

This "needs discussion" issue was discussed during the telecon of 2023-11-30.

From the issue thread above, are we agreed that:

  1. TSV does not make any special case of " or ' because it is separation by a raw TAB.
  2. Turtle serializers more commonly use ".
  3. The current spec text covers quoting and control characters "by using the syntax that SPARQL and Turtle use." (section 5.1). Hence, no raw TABs in RDF term text.
  4. The text would benefit from expanding, such as having inline examples.

Anything else?

afs commented 8 months ago

Related to handling characters: the TSV Media Type does not specify the character set. Nowadays, the "default" for "text/" is UTF-8, a change from the original ASCII.

We can mention this and suggest ("SHOULD") that no character set is treated as UTF-8.

kasei commented 8 months ago

The current spec text covers quoting and control characters "by using the syntax that SPARQL and Turtle use." (section 5.1). Hence, no raw TABs in RDF term text.

I think this one could use just a bit of nuance. There's no need for raw TABs in RDF term text, but SPARQL and Turtle do allow raw tabs in their literal syntax. The SPARQL TSV spec already has language about this, though:

A TSV format SPARQL result set must use the single quoted literal forms, together with any necessary escapes such as \t, \n and \r.

That seems clear enough to me.

Agree that inline examples would be an improvement.