CSVW dialect inconsistencies and issues with commentPrefix

RickMoynihan commented 2 years ago

Hi,

The CSVW Metadata Vocabulary for Tabular Data specifies this for commentPrefix:

An atomic property that sets the comment prefix flag to the single provided value, which MUST be a string. The default is "#".

Yet in a non-normative section in the Tabular Data Model this is contradicted where it states this about the commentPrefix:

A string that, when it appears at the beginning of a row, indicates that the row is a comment that should be associated as a rdfs:comment annotation to the table. This is set by the commentPrefix property of a dialect description. The default is null, which means no rows are treated as comments. A value other than null may mean that the source numbers of rows are different from their numbers.

My understanding therefore is that strictly speaking the correct default according to the standard is # and the value of null would be illegal; because null is not a String.

However this default (and the requirement for it to always be a String) is I think highly problematic for implementers, because it essentially means RFC4180 in UTF-8 is NOT a valid CSVW subset, as you also MUST support comments, because comments are in the default dialect!

Worse the requirement for how it MUST be a String, means that users cannot override and declare a dialect without comments, and therefore they can't restrict themselves to a practical "best-practice" standards-compliant subset.

Unless I'm missing something, it seems that null would be a much better default; and I wonder whether null is actually what was intended, and the guidance is correct, but the standard wrong.

Some clarification around this point would be highly appreciated 🙇

gkellogg commented 2 years ago

Hi,

The CSVW Metadata Vocabulary for Tabular Data specifies this for commentPrefix:

An atomic property that sets the comment prefix flag to the single provided value, which MUST be a string. The default is "#".

Yet in a non-normative section in the Tabular Data Model this is contradicted where it states this about the commentPrefix:

A string that, when it appears at the beginning of a row, indicates that the row is a comment that should be associated as a rdfs:comment annotation to the table. This is set by the commentPrefix property of a dialect description. The default is null, which means no rows are treated as comments. A value other than null may mean that the source numbers of rows are different from their numbers.

My understanding therefore is that strictly speaking the correct default according to the standard is # and the value of null would be illegal; because null is not a String.

I think these two statements are consistent; The metadata document has little to say about commentPrefix, other than to reference it in the data model. If it appears, it must be a string. In the data model, comment prefix is part of the dialect description, and as such is logically always present, and if not set is null. In a metadata document, you can't really set a property to null, as a JSON-LD parser will simply drop such values, and the metadata document prohibits setting it to null.

However this default (and the requirement for it to always be a String) is I think highly problematic for implementers, because it essentially means RFC4180 in UTF-8 is NOT a valid CSVW subset, as you also MUST support comments, because comments are in the default dialect!

It can't be set to null, but has that default value. Note that the dialect description is not either RDF or JSON-LD, but an artifact of the data model with specific (but non-normative) text for how to treat it if it is null (it is ignored).

Worse the requirement for how it MUST be a String, means that users cannot override and declare a dialect without comments, and therefore they can't restrict themselves to a practical "best-practice" standards-compliant subset.

The reason the text in the data model document is non-normative is because generally processing from any specific format (including CSV) is non-normative, and expected to be defined elsewhere. This leaves room for other other formats to be described. Some other format could either require that the value for comment prefix must be null, or that it is ignored.

Unless I'm missing something, it seems that null would be a much better default; and I wonder whether null is actually what was intended, and the guidance is correct, but the standard wrong.

No I'm unclear on what you're asking. The data model document does say the default is null (for CSV), and the metadata document says that, if set explicitly, it must be a string. It doesn't allow you to set it to null explicitly, and neither does the JSON-LD format allow you to do this, or rather, setting anything to null is equivalent to not setting it at all (just as in RDF).

Some clarification around this point would be highly appreciated 🙇

RickMoynihan commented 2 years ago

Hi @gkellogg I really appreciate your response.

Firstly forgive me for perhaps asking the wrong question; or framing too early the issue as a potential inconsistency in the spec.

As I read your response, I'm either still missing something or I've not clearly communicated clearly the wider issue; or you appear to be saying "strictly speaking there is no inconsistency/issue because it's a clear a processor should behave like this:

When interpreting a metadata document without a dialect description or a commentPrefix set, you must assume it provides a default commentPrefix string of "#".
The parsing tabular data section which says of comment prefix that the default is null is a different thing, and even though it takes its value from the metadata document, whatever it says is its default is ignored as it has been overriden by the value provided by the metadata document.

i.e. the metadata documents value has taken precedence over the default in the parsing section.

Assuming the later interpretation, you appear to confirm there is no standards compliant way to instruct the parser to use a commentPrefix of null.

If that's the case; then I agree that is precisely my understanding of what should happen, and I'm also fully aware of the comments you make about JSON-LD and null and nulls closest equivalent in RDF being to not make a statement.

However I think this is missing the wider issue, which it appears to be fundamentally impossible in CSVW to describe a file which is just RFC4180.

This seems unfortunate because RFC4180 is surely the worlds most popular CSV dialect which has a specification associated with it.

Why is this a problem?

We are working with partners across government who are keen to standardise on CSVW. Part of this will likely be putting CSVW files on the web, other parts are using CSVW's csv2rdf to assist in data transformation; but as we grow a bigger part is also about establishing and using CSVW as an interchange format in internal processes, ETL, narrowly defined tools etc.

Helping people implement bits of processes and adopt CSVW compliant tools is part of the process here, and one of the hopes for csvw.org is to eventually provide resources and recommendations to implementers and the people charged with building these processes etc to make the process easier.

However if this is to be successful, I feel we need to be able to define smaller application profiles of CSVW, which let people assume or ignore other bits of the spec.

So there's a community of people who in this world will be hopefully be writing CSVW tools and processes in R, python and any other language they use.

I'd like to say to them, "Oh it's easy if you're not at the edges; just stick with our recommendations of UTF-8 and RFC4180 as a CSVW compliant subset; and use an off the shelf RFC4180 parser, and raise an error on any dialect you don't understand".

However, instead I have to say you either need to

Support comment prefixes in your parser (not many do), or...
Deviate from the CSVW standard and allow null in the metadata file and encourage the use of {"dialect": {"commentPrefix": null}} as a means of saying "pure RFC4180 in UTF-8"; which strikes me as the most sensible default.
We're forced to accept that CSVs metadata file says it "may contain comments"; but it actually doesn't; so we just ignore that the metadata file makes a declaration about the files dialect that isn't true. i.e. it would be useful if a processor could've looked at the metadata document and known "I can't handle that"

It also seems odd that the default metadata dialect assumes a comment; and almost encourages people to step outside of RFC4180. I would have expected the dialect to be a declaration to the parser about how to parse the file; and be a means for a parser to say "I don't understand that particular dialect" and fail. However CSVW instead seems to require you to parse all possible dialects of CSV (expressible in CSVW); which is a much higher bar for implementers than the former.

I also do appreciate the other angle of CSVW about "being on the web", and therefore having to tolerate a wide variety of inputs.

I genuinely appreciate your comments, and trust that you're probably responding in your own time; and I apologise if this comes across as moaning. I really just want to know what the best course of action is for recommendations; and if we need to require people to parse comments etc so be it.

gkellogg commented 2 years ago

Sorry, @RickMoynihan, I responded too quickly. You're right that the default is '#' in the metadata document, which is inconsistent with the data model document, which sets it as null. I actually think that the metadata document is in error, and the default should be null there too, which I'll add as an erratum. My implementation, which is fully conformant, defaults it to null.

However I think this is missing the wider issue, which it appears to be fundamentally impossible in CSVW to describe a file which is just RFC4180.

Given the default value of null, I don't understand why this is the case.

Would such an erratum satisfy your concern?

gkellogg commented 2 years ago

Actually, I see that this is already the subject of #801, which appears as an Erratum for the document. Given that, I believe we should close this issue as duplicating #801.

RickMoynihan commented 2 years ago

Great; that's what I was hoping. That the intention for the default is essentially pure RFC4180 with UTF-8. i.e. an easy recommendation.

Thanks again for clarifying and persevering with me. I'll try and remember to also check the errata in the future. 🙇

RickMoynihan commented 2 years ago

One last question, how and when are errata actually rolled into the spec documents? The errata page didn't seem to mention it.

gkellogg commented 2 years ago

At this point, not until there is a new Working Group charted. There is some effort to create an RDF Maintenance group (coming out of the RDF-star effort) which might be able to do some updates, and there are probably other things that could be done for CSVW, but I don’t see too much support behind that just now.

w3c / csvw

CSVW dialect inconsistencies and issues with commentPrefix #881