w3c / sparql-results-csv-tsv

https://w3c.github.io/sparql-results-csv-tsv/
Other
0 stars 1 forks source link

Add serialization of quoted triples #24

Closed rubensworks closed 1 year ago

rubensworks commented 1 year ago

Closes #18


Preview | Diff

rubensworks commented 1 year ago

The table of content with "Example" then "Example with Quoted Triples" looks a bit weird to me. What about not inserting a new section but just adding a new example with quoted triples just below the existing one?

The main reason I decided to have a separate example section, was to be consistent with the XML and JSON result specs, where we decided that having separate sections is a better way forward. (I initially added it as a single merged example there as well)

Do we want to insert a space between << and the content i.e. write <<

>> and not <<

>>? It might be more readable. I believe nothing has been decided on this topic for canonical NTriples, we should maybe get a decision there and follow the same approach (without blocking this PR).

Sure, I can look into that.

Tpt commented 1 year ago

The main reason I decided to have a separate example section, was to be consistent with the XML and JSON result specs, where we decided that having separate sections is a better way forward. (I initially added it as a single merged example there as well)

Makes sense. Thank you!

Do we want to insert a space between << and the content i.e. write << >> and not << >>? It might be more readable. I believe nothing has been decided on this topic for canonical NTriples, we should maybe get a decision there and follow the same approach (without blocking this PR).

Sure, I can look into that.

Thank you!

afs commented 1 year ago

The first new section 1.2 with data 1.2 looks a bit odd. It could be just another table in one section.

Separate examples like JSON results later makes sense,

(This is not a blocker)

TallTed commented 1 year ago
  1. Do we want to insert a space between << and the content i.e. write << <s> <p> <o> >> and not <<<s> <p> <o>>>? It might be more readable. I believe nothing has been decided on this topic for canonical NTriples, we should maybe get a decision there and follow the same approach (without blocking this PR).

I strongly support including these (syntactically optional) space characters, as they make learning much easier, even if they may be dropped in eventual deployments.

Currently, we have a mix of things like <<_:b ... _:c >> (no extra space at the open, but extra space at the close) and <<_:b ... _:c>> (no extra space at either open or close), and (my preferred) << _:b ... _:c >> (extra spaces at both open and close), in various places (index.html, wiki pages, etc.) which inconsistency does not help comprehension.

gkellogg commented 1 year ago

N-Triples says the following:

White space MUST NOT be used except after subject, predicate, and object, any of which MUST be a single space (U+0020).

that argues against inserting white space.

kasei commented 1 year ago

N-Triples says the following:

White space MUST NOT be used except after subject, predicate, and object, any of which MUST be a single space (U+0020).

that argues against inserting white space.

I agree with @TallTed here that we should include spaces. I think the existing N-Triples text made sense because it ensured there is minimal (but present!) separation between s/p/o. But the introduction of quoted triples will complicate that. If there is nesting of quoted triples without whitespace, you'll end up with the chevrons grouping together <<<<<<s p o>> p o>> p o>> p o. That strikes me as much harder to read, especially if s or o are IRIs using their own < and >.

afs commented 1 year ago

For TSV, the field entries are Turtle syntax terms - section 4.2. Hence, white space can be optional and the example can use it for clarity for presentation.

For CSV, the format is defining the quoted triple presentation. It does not even have to be << >> although it should have some justification if not.

There is no assumption that CSV is parsed using a RDF system. It is off-ramp for data. URIs do no get <> around them for simplicity of use, no language tags, no dadatypes. We have the TSV form when precision is required.

CSV form can be read directly into a spreadsheet and provide some thing useful. For quoted triples, I doubt there is much use but we might as well provide something. So the design "what's simplest to show a quoted triple"?

We need three terms inside the << >> and when a literal is involved, it'll need ""-quoting. But what about , in literals or IRIs? Quotes within fields are not CSV-quotes. There will be outer "-quotes needed.

   name = field
   field = (escaped / non-escaped)
   escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
   non-escaped = *TEXTDATA
   TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

TEXTDATA excludes " and , (U+002C),

<< :s :p "hello, dave" >> is two CSV cells. It needs to be in escaped as "<< :s :p ""hello, dave"" >>"

So proposal:

  1. Always CSV-escaped.
  2. Prefer "- for literals (and hence "" in CSV) over '-literals because " is more common.
  3. Within the CSV escaped field, << subject-term predicate-term object-term >> with single white space preferred after/before quoted triple quotes.
rubensworks commented 1 year ago

I've updated the PR based on all reviews and comments so far.

@afs I did not fully understand (the last part of) your last comment. Is a change in the spec required for this? It seems like escaping is already handled following RFC4180.

afs commented 1 year ago

@rubensworks

Concrete example: how does a literal with a space get endocded? << <x:x> <x:p> "abc def" >>

Until now, literals are written raw, maybe employing CSV-defined "" (not Turtle "")

CSV-" applies to the cell. It is the CSV parser that removes them and processes inner ""

But with quoted triples, the literal is within a longer string in the table cell. If it is surround by double quotes (CSV escaped), the CSV parser isn't doing anything with them.

So I don't see what the cell is for the Turtle << <x:x> <x:p> "abc def" >>.

rubensworks commented 1 year ago

Thanks for the clarification @afs.

As written now, << <x:x> <x:p> "abc def" >> would be represented as << x:x x:p abc def >>, which may be a bit too lossy indeed.

If I understand your proposal above correctly, you suggest to always represent literals in quoted triples using ". So in this case, we would get << x:x x:p "abc def" >>.

If our quoted triple would contain ,'s such as << <x:x,y> <x:p> "abc, def" >>, we could escape the whole quoted triple, so this would become "<< x:x,y x:p ""abc, def"" >>".

If this interpretation seems correct, I will incorporate this within this PR.

afs commented 1 year ago

Not so much a proposal as working through the implications.

This is the CSV format - I don't believe quote triples will be important. If they are, the TSV format is much better.

We need a legal CSV cell entry. Some quoting seems to be the best we can do for presentation if we are rejecting the bare literal form (it's an option for us; I'd prefer quoting).

One detail: if the literal itself includes ", we can either:

  1. escape Turtle style the " to get characters \"" in the string.
  2. Use CSV escapes recursively ending up with """" and it becomes "" in the results.

Either can be made to work, both require the writer to do something. I don't mind which.

An alternative is to use single quotes. Technically better (less nested quoting issues) but a little less user friendly because double quotes is more common usage generally.

rubensworks commented 1 year ago
  1. Use CSV escapes recursively ending up with """" and it becomes "" in the results.

This option is probably the best way forward, since it's consistent with all other escaping in CSV. (and it may even be implicitly assumed according to the current wording in the document, but would be good to make it explicit.)

rubensworks commented 1 year ago

@afs I just pushed a new commit, which should resolve the issues you raised.