w3c / csvw

Documents produced by the CSV on the Web Working Group
Other
162 stars 57 forks source link

XSD schemas as datatypes #524

Closed iherman closed 9 years ago

iherman commented 9 years ago

Robert Baldy's comment on the mailing list. It may also be linked to #223, which was also triggered by @robald7.

The essence of the comment is, in my understanding:

then I wonder why it would not be possible to have something like "@type" pointing to a XSD schema element? It is always easy not to use a facility, much more difficult to add it afterwards.

(I presume this should refer to the datatype property rather than @type.)

robald7 commented 9 years ago

Thanks for the question It seems to me that as "datatype" is not comprehensive enough, that "@type" could be used instead, like I think in Json-Ld, to allow more complicated types. But I agree, it is really the essence of the comment! Best wishes r,

On 29/04/15 15:24, Ivan Herman wrote:

Robert Baldy's comment http://www.w3.org/mid/5540D6A7.3010104@gide.net on the mailing list. It may also be linked to #223 https://github.com/w3c/csvw/issues/223, which was also triggered by @robald7 https://github.com/robald7.

The essence of the comment is, in my understanding:

then I wonder why it would not be possible to have something like
"@type <https://github.com/type>" pointing to a XSD schema
element? It is always easy not to use a facility, much more
difficult to add it afterwards.

(I presume this should refer to the |datatype| property rather than |@type|.)

— Reply to this email directly or view it on GitHub https://github.com/w3c/csvw/issues/524.

This message and the information contained within it is intended for the recipient alone and any unintentional recipient should not act upon the information apart from notifying the sender that the message has been inadvertently diverted. The unintended recipient should delete the message and inform the sender of the error. Please consider the environment before printing this email.

JeniT commented 9 years ago

@robald7 Can you confirm which of the following scenarios is what you're envisioning:

  1. Do processors see the XML schema datatype reference, retrieve an XML schema document, parse and process that document to extract the XML schema datatype definition, and then apply the datatype definition to the values in the CSV file?
  2. Do processors see the XML schema datatype reference and ignore it for validation purposes but use the name of the datatype to label values in any generated (eg RDF/XML) data?
robald7 commented 9 years ago

Good morning

I am not quite sure I understand the choice, so let me tell you what I would do now without any new special tool. This being on a server, not in a browser as in my experience so far there was no reason ever to send csv to a browser (usually data produced by an organisation is made available as CSV for other people to use to produce sites, do statistical analyses and so on. And it is when receiving data or giving data that a standard is much needed)

1) a) if the datatypes are all XSD, i would simply transform the csv into basic xml and validate against the schema b) if the datatypes are a mix of suggested way to describe types and XSD schema, I would convert the "json" types to XSD and back to a).

On this site

http://infotap.sda-ltd.com/dfe-like.html which has not been updated to take into account last developments you can see what I have in mind

2) Of course, one can always ignore validation basically making each piece of data a string. For my own use, I would always prefer to have the types of the variables available in a formal way (maybe more than one), some people don't care as they think that it is up to the application to validate the data which leaves open the question on how you describe the data!

I had a quick look yesterday at the ODI's csvlint, looking at some schemas I have seen some mentions of XMLschema, and also a "corrected" dataset having all data as strings. Is the csvlint project a test-bed for W3C schemas?

Best wishes r

On 03/05/15 10:58, Jeni Tennison wrote:

@robald7 https://github.com/robald7 Can you confirm which of the following scenarios is what you're envisioning:

1.

Do processors see the XML schema datatype reference, retrieve an
XML schema document, parse and process that document to extract
the XML schema datatype definition, and then apply the datatype
definition to the values in the CSV file?

2.

Do processors see the XML schema datatype reference and ignore it
for validation purposes but use the name of the datatype to label
values in any generated (eg RDF/XML) data?

— Reply to this email directly or view it on GitHub https://github.com/w3c/csvw/issues/524#issuecomment-98461438.

iherman commented 9 years ago

Marking this as resolved, by virtue of the (new and, by now, editorial) issue #543. @robald7, do you agree that this issue can be closed as soon as #543 is?

robald7 commented 9 years ago

Good morning

Given

"datatype": { "@id": "http://example.org/datatypes/age", "base": "integer", "minimum": 0, "maximum": 120 }

The identifier|http://example.org/datatypes/age|would be irrelevant for validation purposes, but on conversion to RDF (or potentially XML), the identified datatype could be associated with the value. It could also offer a location for further information or definition of the datatype.

I have some questions

-1) Why would/should it be irrelevant for validation purpose? -2) if I was going to write simply "datatype" : { "@id" : "http://example.org/datatypes/age" } would it be acceptable? -3) since you write about other typing systems, would it not be needed to say something about the type of the schema? I can see that 1) and 2) are in some ways related.

While being a strong partisan of the Semantic Web and understanding you big part in making it happen, I don't think that at this moment this is the most important! We want good well described data for whatever purpose.

Best wishes r,

On 10/05/15 09:20, Ivan Herman wrote:

Marking this as resolved, by virtue of the (new and, by now, editorial) issue #543 https://github.com/w3c/csvw/issues/543. @robald7 https://github.com/robald7, do you agree that this issue can be closed as soon as #543 https://github.com/w3c/csvw/issues/543 is?

— Reply to this email directly or view it on GitHub https://github.com/w3c/csvw/issues/524#issuecomment-100596753.

This message and the information contained within it is intended for the recipient alone and any unintentional recipient should not act upon the information apart from notifying the sender that the message has been inadvertently diverted. The unintended recipient should delete the message and inform the sender of the error. Please consider the environment before printing this email.

iherman commented 9 years ago

Hi @robald7

The final editing is still to be done, but here are my personal reactions to your questions:

-1) Why would/should it be irrelevant for validation purpose?

I do not think the term "irrelevant" should be used in the spec text. But what it means is that the validators may ignore this value, because validators are not required to do datatype validation on other datatypes than the ones listed in this spec.

-2) if I was going to write simply "datatype" : { "@id" : "http://example.org/datatypes/age" } would it be acceptable?

I believe so.

-3) since you write about other typing systems, would it not be needed to say something about the type of the schema? I can see that 1) and 2) are in some ways related.

Only the URI is used in the generated RDF (as an identification of the datatype), meaning that the only way to identify the datatype (which is not necessarily a schema, it can be an OWL term in one of many different possible OWL serializations!) is the media type.

robald7 commented 9 years ago

Good evening In short, and if I understand well, a validator will be obliged to validate types defined in the spec, and if there is an "@id", it may (or not) use it to validate. Understood like that, it suits me fine (and I am not too worried about what to do when two possible validations are present) I shall try this approach on "real" data and will let you know the problems I may have. Many thanks r (for info, I have not been able to access the w3c site today)

On 10/05/15 17:14, Ivan Herman wrote:

Hi @robald7 https://github.com/robald7

The final editing is still to be done, but here are my personal reactions to your questions:

-1) Why would/should it be irrelevant for validation purpose?

I do not think the term "irrelevant" should be used in the spec text. But what it means is that the validators may ignore this value, because validators are not required to do datatype validation on other datatypes than the ones listed in this spec.

-2) if I was going to write simply
"datatype" : { "@id <https://github.com/id>" :
"http://example.org/datatypes/age" }
would it be acceptable?

I believe so.

-3) since you write about other typing systems, would it not be
needed to say something about the type of the schema? I can see
that 1) and 2) are in some ways related.

Only the URI is used in the generated RDF (as an identification of the datatype), meaning that the only way to identify the datatype (which is not necessarily a schema, it can be an OWL term in one of many different possible OWL serializations!) is the media type.

— Reply to this email directly or view it on GitHub https://github.com/w3c/csvw/issues/524#issuecomment-100654064.

This message and the information contained within it is intended for the recipient alone and any unintentional recipient should not act upon the information apart from notifying the sender that the message has been inadvertently diverted. The unintended recipient should delete the message and inform the sender of the error. Please consider the environment before printing this email.

6a6d74 commented 9 years ago

-2) if I was going to write simply "datatype" : { "@id" : "http://example.org/datatypes/age" } would it be acceptable?

As @iherman says, the datatype would appear in the RDF output, but not in the (plain-old) JSON.

To assist the validation, it might also be useful to express something the following for your example:

"datatype" : { "base": "nonNegativeInteger", "@id" : "http://example.org/datatypes/age" }

In this case, the validation would (tbc?) operate against the base type, whilst the output would include the datatype specified by @id.

Thoughts?

robald7 commented 9 years ago

Hi Sorry to repeat myself, but I think that the most important thing is to be able to validate data as completely as possible if one wants to! RDF or JSON are, at least to me, at this moment peripheral issues :-) So in my "rewritten" example, I would use a validator checking against the "@id" type, and if not available against the "base", but then accept that some data may not be valid, and at the same time be completely valid according to the specs for this data (ie clearly intended). Best wishes r,

On 11/05/15 10:35, Jeremy Tandy wrote:

-2) if I was going to write simply
|"datatype" : { "@id" : "http://example.org/datatypes/age" }|
would it be acceptable?

As @iherman https://github.com/iherman says, the datatype would appear in the RDF output, but not in the (plain-old) JSON.

To assist the validation, it might also be useful to express something the following for your example:

|"datatype" : { "base": "nonNegativeInteger", "@id" : "http://example.org/datatypes/age" }|

In this case, the validation would (tbc?) operate against the |base| type, whilst the output would include the datatype specified by |@id|.

Thoughts?

— Reply to this email directly or view it on GitHub https://github.com/w3c/csvw/issues/524#issuecomment-100815941.

This message and the information contained within it is intended for the recipient alone and any unintentional recipient should not act upon the information apart from notifying the sender that the message has been inadvertently diverted. The unintended recipient should delete the message and inform the sender of the error. Please consider the environment before printing this email.

robald7 commented 9 years ago

Good morning

As I am not quite sure what is the current position on datatypes, I have built a small example which in some ways exhibit my questions. I have been using xml and xsd. It is clearly easy to go from the type for U to the proposed W3C schema for CSV, but not so for V unless using a reference to a schema if no union of dataypes is possible in the W3C schema While the type for U with pattern is acceptable, it does not reflect well the "true" nature of the data ("decimal") and does not allow checking for minimum and maximum values. what do you think should be the solution? I have started (not full-time) to look at the CSV data examples on the site, so far I would say much of the data is mostly of type "string", sometimes by error : "£1.5", ok for display, but needs removing the "£" for doing anything useful with it; also a lot of the data are really values in code-lists, I understand that these could be described as foreign keys or strings with patterns, but I think that they should be promoted to a type definition Best wishes r,

On 10/05/15 17:14, Ivan Herman wrote:

Hi @robald7 https://github.com/robald7

The final editing is still to be done, but here are my personal reactions to your questions:

-1) Why would/should it be irrelevant for validation purpose?

I do not think the term "irrelevant" should be used in the spec text. But what it means is that the validators may ignore this value, because validators are not required to do datatype validation on other datatypes than the ones listed in this spec.

-2) if I was going to write simply
"datatype" : { "@id <https://github.com/id>" :
"http://example.org/datatypes/age" }
would it be acceptable?

I believe so.

-3) since you write about other typing systems, would it not be
needed to say something about the type of the schema? I can see
that 1) and 2) are in some ways related.

Only the URI is used in the generated RDF (as an identification of the datatype), meaning that the only way to identify the datatype (which is not necessarily a schema, it can be an OWL term in one of many different possible OWL serializations!) is the media type.

— Reply to this email directly or view it on GitHub https://github.com/w3c/csvw/issues/524#issuecomment-100654064.

This message and the information contained within it is intended for the recipient alone and any unintentional recipient should not act upon the information apart from notifying the sender that the message has been inadvertently diverted. The unintended recipient should delete the message and inform the sender of the error. Please consider the environment before printing this email.

iherman commented 9 years ago

@robald7,

As I am not quite sure what is the current position on datatypes,

The current situation is:

  1. The datatype annotation/property will include an additional property called (I believe) extension. This property is a URL and would refer to a datatype definition in some datatype specification system (XML Schema, OWL 2 datatype restriction, etc). Metadata validators MAY validate the cell value against that dataype, and the RDF mapping will use a typed literal with that specific URL as a type URL.
  2. The question whether we would allow a union of datatype is still pending. Yours was the only example that led to that possibility, there were no convincing argument coming from other sources to go down that route. The impression we got from your earlier comments suggested that your use cases are solved using extension property.

"£1.5", ok for display, but needs removing the "£" for doing anything useful with it [...] I think that they should be promoted to a type definition

I would not like this Working Group going down the route of defining additional datatypes (beyond the ones defined for XSD). It would go way beyond its current mandate. This should probably be picked up by the XML Schema people extending the current XSD. At the moment, the extension property should cover this feature, with validators using their own datatype definitions.

Cheers

I.

robald7 commented 9 years ago

Thanks for the reply 1) If "extension" allows me to specify an entry in a schema against which I can validate data in a cell,I am happy. 2) about "£1.5", I want nothing to be done! I was just saying it is a bad example of coding. 3) The data I looked at so far on the W3C CSV site use mostly "string", there is no need indeed for an union of data types, but in one numeric example there is a "-999" value, clearly this needs something to be done. And about the fact that there are few examples about the need, it could also be that a lot of examples are about aggregated data and not describing individual data where there is nearly always missing data from one sort or another. 4) I agree completely with not wanting to add additional datatypes 5) A lot of data in the W3C CSV examples are basically data coming from code-lists. maybe something which needs some consideration besides foreign keys or string/pattern?

In short, "extension" will solve most of my problems. To me the idea of a standard about data is to basically define a contract between producer and consumer allowing for automatic validation. This is much needed, while I have always been able to deal with most textual data (exhibiting some regularity) so far by using the usual Unix tools (tr, awk/perl,head/tail,..) it is not the best way forward. Hopefully the schema would replace all this "manual" data cleaning before processing.

Best wishes r,

On 17/05/15 13:08, Ivan Herman wrote:

@robald7 https://github.com/robald7,

As I am not quite sure what is the current position on datatypes,

The current situation is:

  1. The |datatype| annotation/property will include an additional property called (I believe) |extension|. This property is a URL and would refer to a datatype definition in some datatype specification system (XML Schema, OWL 2 datatype restriction, etc). Metadata validators MAY validate the cell value against that dataype, and the RDF mapping will use a typed literal with that specific URL as a type URL.
  2. The question whether we would allow a union of datatype is still pending. Yours was the only example that led to that possibility, there were no convincing argument coming from other sources to go down that route. The impression we got from your earlier comments suggested that your use cases are solved using |extension| property.

    "£1.5", ok for display, but needs removing the "£" for doing anything useful with it [...] I think that they should be promoted to a type definition

I would not like this Working Group going down the route of defining additional datatypes (beyond the ones defined for XSD). It would go way beyond its current mandate. This should probably be picked up by the XML Schema people extending the current XSD. At the moment, the |extension| property should cover this feature, with validators using their own datatype definitions.

Cheers

I.

— Reply to this email directly or view it on GitHub https://github.com/w3c/csvw/issues/524#issuecomment-102783723.

This message and the information contained within it is intended for the recipient alone and any unintentional recipient should not act upon the information apart from notifying the sender that the message has been inadvertently diverted. The unintended recipient should delete the message and inform the sender of the error. Please consider the environment before printing this email.

iherman commented 9 years ago

Thanks for the reply 1) If "extension" allows me to specify an entry in a schema against which I can validate data in a cell,I am happy.

:-)

To be very precise, though: a validator is not REQUIRED to perform validation against, say, an OWL 2 Datatype defined that way, in the sense that there may be perfectly comformant validators that do not understand that datatype. But more advanced validators may do it and they have the standard hooks to do so.

2) about "£1.5", I want nothing to be done! I was just saying it is a bad example of coding.

Ah, o.k. I misunderstood.

Cheers

Ivan

robald7 commented 9 years ago

thanks for the precisions and while probably not possible, and in keeping with the idea of "not required", it would be nice to have a hierarchical chain of validations in order for instance, define a datatype as xsd (or other)> current w3c-schema>string the last one being the default and difficult to fault?

this is probably the way I should build datasets and rather than having 2/3 different schemas one would do

best wishes r

Sent from my iPhone

On 17 May 2015, at 16:27, Ivan Herman notifications@github.com wrote:

Thanks for the reply 1) If "extension" allows me to specify an entry in a schema against which I can validate data in a cell,I am happy.

:-)

To be very precise, though: a validator is not REQUIRED to perform validation against, say, an OWL 2 Datatype defined that way, in the sense that there may be perfectly comformant validators that do not understand that datatype. But more advanced validators may do it and they have the standard hooks to do so.

2) about "£1.5", I want nothing to be done! I was just saying it is a bad example of coding.

Ah, o.k. I misunderstood.

Cheers

Ivan

— Reply to this email directly or view it on GitHub.

robald7 commented 9 years ago

Good morning

I have looked at most of the examples in "CSV Data on the Web: Use Cases and Requirements". I am keeping the more complicated for later! but from what I have seen the main problem is what is called in the document "AssociationOfCodeValuesWthExternalDefinitions" (what is usually called "controlled lists", "codelists"). Is the only way to do them is to use of a foreign key? or string with patterns?

Having looked a little more at Case 11, trees in Palo Alto, it seems also to me that the union of data types is needed as for instance variable "Species" a "natural" codelist and some other values like "OBSOLETE SITE", "Vacant site (small tree)" which are not "species" "Trim cycle" has integers and strings (codelist) "Height code" has integers and strings (codelist ie things like 35-40)

so the attribute "extension" to describe a datatype has a good future! if it allows the use of XSD datatypes

Best wishes r,

On 17/05/15 16:27, Ivan Herman wrote:

Thanks for the reply 1) If "extension" allows me to specify an entry in a schema against which I can validate data in a cell,I am happy.

:-)

To be very precise, though: a validator is not REQUIRED to perform validation against, say, an OWL 2 Datatype defined that way, in the sense that there may be perfectly comformant validators that do not understand that datatype. But more advanced validators may do it and they have the standard hooks to do so.

2) about "£1.5", I want nothing to be done! I was just saying it is a bad example of coding.

Ah, o.k. I misunderstood.

Cheers

Ivan

— Reply to this email directly or view it on GitHub https://github.com/w3c/csvw/issues/524#issuecomment-102807820.

This message and the information contained within it is intended for the recipient alone and any unintentional recipient should not act upon the information apart from notifying the sender that the message has been inadvertently diverted. The unintended recipient should delete the message and inform the sender of the error. Please consider the environment before printing this email.

JeniT commented 9 years ago

@robald7 I'd like to close this issue as I think that the content of the original issue has been addressed (we now have the @id property which can point at an external definition of a datatype, and which could feasibly identify a datatype defined through XML Schema). Do you agree that we can do this?

robald7 commented 9 years ago

Thanks Can you point me to an example showing this property in use? I had a quick look at the editors' versions and could not find one. Best wishes r,

On 10/06/15 11:49, Jeni Tennison wrote:

@robald7 https://github.com/robald7 I'd like to close this issue as I think that the content of the original issue has been addressed (we now have the |@id| property which can point at an external definition of a datatype, and which could feasibly identify a datatype defined through XML Schema). Do you agree that we can do this?

— Reply to this email directly or view it on GitHub https://github.com/w3c/csvw/issues/524#issuecomment-110675323.

This message and the information contained within it is intended for the recipient alone and any unintentional recipient should not act upon the information apart from notifying the sender that the message has been inadvertently diverted. The unintended recipient should delete the message and inform the sender of the error. Please consider the environment before printing this email.

gkellogg commented 9 years ago

Test242 illustrates the use of this; there is no specific example, and I'm not sure that's warranted given the relatively narrow use of this feature (IMO), but if you have a specific suggestion on how an existing example can be simply expanded to include this we could consider that.

The test suite does serve as a living example for various corner cases not covered in the main documents.

robald7 commented 9 years ago

Thanks for directing me to this test suite, I shall look at it.

And while I don't want to appear as a fanatical supporter of XSD schema, I think that the most important part that the standardisation of CSV covers is the description of data, specially the types. Having a good description of the data will encourage people to use it, if the description is lacking, for instance finishing with string as the most used data type, it will not be seen as a real progress.

In my work, I am dealing often with survey data, taking for instance "age", we can say that it is an integer between 0 and 120. But always there will be things like "Don't know", "Refuse to say",... Clearly there are at least 3 ways to look at the type 1) integer (0,120) + codes (DNK,REF,..) +"" , this last one being missing for whatever reason 2) integer(0,120) + "", all non-integer values are replaced by "" 1) carries more information than 2), but if not implemented, then I would then prefer to use 3) string as the validation will not report errors, then I shall do the checking later on in the process

As a real example, why not use the variable "species" for the Palo Alto trees example where there are clearly "species", "unknown" (it was not possible to say what species the tree belongs to) and some coding mistakes. When validating the data, I would want the mistakes to be be discovered, but "unknown" should go through and at the same time I would not want to add "unknown" to a list of species which could be used somewhere else completely unrelated.

Other examples could be found on

http://www.education.gov.uk/schools/performance/download_data.html In addition to integer, float, percentage values, codes we have things like "SUPP","NE","NA","NP" which have definite meanings in the context

This email is a bit of a "repeat", and as IH has pointed earlier since it will be possible to use the facility to refer to XSD schemas, it is ok with me even if think that in this case when producing such data I shall always use this xsd facility (ie not mix the 2 ways to declare the datatypes)

Best wishes r,

On 26/06/15 02:44, Gregg Kellogg wrote:

Test242 http://w3c.github.io/csvw/tests/#manifest-rdf#test242 illustrates the use of this; there is no specific example, and I'm not sure that's warranted given the relatively narrow use of this feature (IMO), but if you have a specific suggestion on how an existing example can be simply expanded to include this we could consider that.

The test suite does serve as a living example for various corner cases not covered in the main documents.

— Reply to this email directly or view it on GitHub https://github.com/w3c/csvw/issues/524#issuecomment-115447149.

This message and the information contained within it is intended for the recipient alone and any unintentional recipient should not act upon the information apart from notifying the sender that the message has been inadvertently diverted. The unintended recipient should delete the message and inform the sender of the error. Please consider the environment before printing this email.

iherman commented 9 years ago

Decision at http://www.w3.org/2015/07/01-csvw-irc#T14-13-49

robald7 commented 9 years ago

https://github.com/w3c/csvw/tests/ seems to be not working

also I thought that the word "pattern" could be used in something like "datatype" : { "base" : "string", "pattern" : "(aaaa|bbbb|cccc)" . I cannot see it in the two published documents. Has "format" replaced it?

best wishes r,

gkellogg commented 9 years ago

https://github.com/w3c/csvw/tests/ redirects to http://w3c.github.io/csvw/tests/; there must be some issue with the W3C setup, which @iherman can look into further. If you're running tests, you may want to clone the repo and run the files locally to avoid network delays using an appropriate shim. This is how the rdf-tabular gem runs the tests, and is much nicer for development.

also I thought that the word "pattern" could be used in something like "datatype" : { "base" : "string", "pattern" : "(aaaa|bbbb|cccc)" . I cannot see it in the two published documents. Has "format" replaced it?

A datatype definition may include format, in the case of a numeric datatype, the format may be an object including pattern; this hasn't changed in some time.

robald7 commented 9 years ago

Thanks, I shall do that. and no problem for using "format" rather than "pattern" with strings bw r Sent from my iPhone

On 10 Aug 2015, at 23:51, Gregg Kellogg notifications@github.com wrote:

https://github.com/w3c/csvw/tests/ redirects to http://w3c.github.io/csvw/tests/; there must be some issue with the W3C setup, which @iherman can look into further. If you're running tests, you may want to clone the repo and run the files locally to avoid network delays using an appropriate shim. This is how the rdf-tabular gem runs the tests, and is much nicer for development.

also I thought that the word "pattern" could be used in something like "datatype" : { "base" : "string", "pattern" : "(aaaa|bbbb|cccc)" . I cannot see it in the two published documents. Has "format" replaced it?

A datatype definition may include format, in the case of a numeric datatype, the format may be an object including pattern; this hasn't changed in some time.

— Reply to this email directly or view it on GitHub.

robald7 commented 9 years ago

Good afternoon I have been experimenting with the Palo Alto trees data set, creating "automatically" metadata description from the csv using the "@id" feature only once (to describe the GrowSpace as a an union) and validating the data (json description -> xml schema+ xml data then using xmllint to validate), it works as expected but since i am mostly buliding the description from the data this is the way it should be! A good test would be to have a codebook for this dataset, have you got one? Best wishes r