w3c / csvw

Documents produced by the CSV on the Web Working Group
Other
162 stars 57 forks source link

Metadata merge order and extraction from CSV #144

Closed gkellogg closed 9 years ago

gkellogg commented 9 years ago

In working on my implementation, I've faced come seeming contradictions in assertions made in Model, Metadata, RDF and JSON documents.

TL-DR: consider changing metadata processing order to place embedded metadata at a lower precidence than other metadata.

In the Model document Locating Metadata section, the processing order of metadata gives the following precidence:

  1. User-supplied metadata,
  2. Embedded metadata,
  3. Metadata from a Link header,
  4. File-specific metadata, and
  5. Directory-specific metadata

Setting asside the format of any user-supplied metadata, I take this to mean the following:

  1. Open the target file (either CSV or metadata JSON)
  2. If the result of opening the file resulted in a Link header with rel=describedby, process the resulting file as a metadata document
  3. Otherwise, if "file-metadata.json" exists, where "file" is the opened file without an extension, attempt to open that and process it as a Metadata document
  4. Otherwise, if "metadata.json" relative to the original file, attempt to open that and process it as a Metadata document.
  5. Otherwise, create new empty Table metadata.

Attempt to extract metadata from the file, resulting in something like the following:

Given:

countryCode,latitude,longitude,name
AD,42.546245,1.601554,Andorra
AE,23.424076,53.847818,"United Arab Emirates"
AF,33.93911,67.709953,Afghanistan

Processing the file using steps 1-5 from above, presuming that no other metadata is found results in defaults for Dialect information. This would produce metadata similar to the following:

{
  "@context": "http://www.w3.org/ns/csvw",
  "@id": "https://example.org/countries.csv",
  "schema": {
    "columns": [{
      "name": "countryCode",
      "title": "countryCode",
      "predicateUrl": "countryCode",
    }, {
      "name": "latitude",
      "title": "latitude",
      "predicateUrl": "latitude",
    }, {
      "name": "longitude",
      "title": "longitude",
      "predicateUrl": "longitude",
    }, {
      "name": "name",
      "title": "name",
      "predicateUrl": "name",
    }],
    "urlTemplate": "_:"
  }
}

This is then merged in with the (empty) metadata to create the metadata used to process the document. (Note this being the case, the purpose of the Core Tablular Data processing model is not clear).

If there were directory-specific metadata (such as in Example 22 in Foreign Key Reference Between Resources, the description of merging in csv2rdf seem to be at odds with my notion of what it means to merge documents together.

I think processing of metadata should probably be left to the metadata and model documents, so that it's not repeated needlessly. If we establish rules for extracting metadata from a CSV, along with clear precidence rules for Inherited Properties, there should be no need to repeat this in the csv2rdf and csv2json documents.

In any event, the merge semantics described here seem to be at odds with the merge order described above. For example, merging the embedded metadata into the directory metadata would replace the "name" and "urlTemplate" values with the extracted values. I realize that the wording says not to overwrite those values (or to order "title" content), but that seems like a reverse merge to me.

Perhaps the order should make the embedded metadata at a lower precidence, so that we start with embedded metadata, and merge in metadata found in steps 2-4 above before merging in user-provided metadata. The primary consideration would be for merging in the "title" property, which would need to ensure that merged-in "title" has a non-empty intersection with any existing value, and that the results of the merge are ordered with the original value first, followed by other values (although, this is significantly complicted by the use of language-maps, which are inherently unordered).

Processing is still complicated, as Dialect values probably need to be applied before the input is processed which implies two passes of the input document: first to extract embedded metadata, and then to process the file including all embedded, found, and user-specified metadata.

I think that these change, combined with consolodating metadata processing in metadata & model documents, would make things much simpler. (And, if the Core Tabular Data could use the same rules, but just change the default for headerRowCount to 0, it could be simplified further still).

iherman commented 9 years ago

Gregg,

I see that Jeni has merged already the separate branch on importing metadata:

http://w3c.github.io/csvw/metadata/index.html#importing-metadata

It is a relatively complex algorithm, there are some open issues, but we should check whether it answers your problems. I think it should....

Ivan


Ivan Herman Tel:+31 641044153 http://www.ivan-herman.net

(Written on mobile, sorry for brevity and misspellings...)

On 03 Jan 2015, at 22:19, Gregg Kellogg notifications@github.com wrote:

In working on my implementation, I've faced come seeming contradictions in assertions made in Model, Metadata, RDF and JSON documents.

TL-DR: consider changing metadata processing order to place embedded metadata at a lower precidence than other metadata.

In the Model document Locating Metadata section, the processing order of metadata gives the following precidence:

User-supplied metadata, Embedded metadata, Metadata from a Link header, File-specific metadata, and Directory-specific metadata Setting asside the format of any user-supplied metadata, I take this to mean the following:

Open the target file (either CSV or metadata JSON) If the result of opening the file resulted in a Link header with rel=describedby, process the resulting file as a metadata document Otherwise, if "file-metadata.json" exists, where "file" is the opened file without an extension, attempt to open that and process it as a Metadata document Otherwise, if "metadata.json" relative to the original file, attempt to open that and process it as a Metadata document. Otherwise, create new empty Table metadata. Attempt to extract metadata from the file, resulting in something like the following:

Given:

countryCode,latitude,longitude,name AD,42.546245,1.601554,Andorra AE,23.424076,53.847818,"United Arab Emirates" AF,33.93911,67.709953,Afghanistan Processing the file using steps 1-5 from above, presuming that no other metadata is found results in defaults for Dialect information. This would produce metadata similar to the following:

{ "@context": "http://www.w3.org/ns/csvw", "@id": "https://example.org/countries.csv", "schema": { "columns": [{ "name": "countryCode", "title": "countryCode", "predicateUrl": "countryCode", }, { "name": "latitude", "title": "latitude", "predicateUrl": "latitude", }, { "name": "longitude", "title": "longitude", "predicateUrl": "longitude", }, { "name": "name", "title": "name", "predicateUrl": "name", }], "urlTemplate": "_:" } } This is then merged in with the (empty) metadata to create the metadata used to process the document. (Note this being the case, the purpose of the Core Tablular Data processing model is not clear).

If there were directory-specific metadata (such as in Example 22 in Foreign Key Reference Between Resources, the description of merging in csv2rdf seem to be at odds with my notion of what it means to merge documents together.

I think processing of metadata should probably be left to the metadata and model documents, so that it's not repeated needlessly. If we establish rules for extracting metadata from a CSV, along with clear precidence rules for Inherited Properties, there should be no need to repeat this in the csv2rdf and csv2json documents.

In any event, the merge semantics described here seem to be at odds with the merge order described above. For example, merging the embedded metadata into the directory metadata would replace the "name" and "urlTemplate" values with the extracted values. I realize that the wording says not to overwrite those values (or to order "title" content), but that seems like a reverse merge to me.

Perhaps the order should make the embedded metadata at a lower precidence, so that we start with embedded metadata, and merge in metadata found in steps 2-4 above before merging in user-provided metadata. The primary consideration would be for merging in the "title" property, which would need to ensure that merged-in "title" has a non-empty intersection with any existing value, and that the results of the merge are ordered with the original value first, followed by other values (although, this is significantly complicted by the use of language-maps, which are inherently unordered).

Processing is still complicated, as Dialect values probably need to be applied before the input is processed which implies two passes of the input document: first to extract embedded metadata, and then to process the file including all embedded, found, and user-specified metadata.

I think that these change, combined with consolodating metadata processing in metadata & model documents, would make things much simpler. (And, if the Core Tabular Data could use the same rules, but just change the default for headerRowCount to 0, it could be simplified further still).

— Reply to this email directly or view it on GitHub.

gkellogg commented 9 years ago

@JeniT's update is quite useful, but doesn't entirely address my concerns:

The Importing Metadata section describes the semantics of the import property, and does not directly speak to processing of metadata found from user-specified, embedded, and found metadata. You might imagine that user-specified metadata has an implicit import of embedded metadata, which has an implicit import of found metadata. But, this would result in title, name, and predicateUrl properties which can't very well be overridden using these merge instructions. The name property is atomic, so setting this through embedded metadata means that found metadata can't change it. Title is a natural language property, so the effect would be to merge in downstream values, which may be what is desired, but I'm not sure. predicateUrl is also an atomic property, and could also not be changed by down-stream metadata.

This also leaves open how to establish dialect values when processing a CSV, but there's nothing contradictory. It does seem to me that in this case, merging found metadata into user-specified metadata would be useful before opening and reading the CSV (header-rows, language, etc.)

More comments on issue #105 (the merge definition).

iherman commented 9 years ago

Hey Gregg,

some comments below.

On 03 Jan 2015, at 22:19 , Gregg Kellogg notifications@github.com wrote:

In working on my implementation, I've faced come seeming contradictions in assertions made in Model, Metadata, RDF and JSON documents.

TL-DR: consider changing metadata processing order to place embedded metadata at a lower precidence than other metadata.

In the Model document Locating Metadata section, the processing order of metadata gives the following precidence:

• User-supplied metadata, • Embedded metadata, • Metadata from a Link header, • File-specific metadata, and • Directory-specific metadata Setting asside the format of any user-supplied metadata, I take this to mean the following:

• Open the target file (either CSV or metadata JSON) • If the result of opening the file resulted in a Link header with rel=describedby, process the resulting file as a metadata document • Otherwise, if "file-metadata.json" exists, where "file" is the opened file without an extension, attempt to open that and process it as a Metadata document • Otherwise, if "metadata.json" relative to the original file, attempt to open that and process it as a Metadata document. • Otherwise, create new empty Table metadata.

Actually, the current text is not crystal clear whether this is is really 'otherwise'. I seem to remember that this is what we said at the F2F, but this may be discussed separately. I will comment separately on this for issue #105.

Attempt to extract metadata from the file, resulting in something like the following:

Given:

countryCode,latitude,longitude,name AD,42.546245,1.601554,Andorra AE,23.424076,53.847818,"United Arab Emirates" AF,33.93911,67.709953,Afghanistan

Processing the file using steps 1-5 from above, presuming that no other metadata is found results in defaults for Dialect information. This would produce metadata similar to the following:

{ "@context": "http://www.w3.org/ns/csvw", "@id": "https://example.org/countries.csv", "schema": { "columns": [{ "name": "countryCode", "title": "countryCode", "predicateUrl": "countryCode", }, { "name": "latitude", "title": "latitude", "predicateUrl": "latitude", }, { "name": "longitude", "title": "longitude", "predicateUrl": "longitude", }, { "name": "name", "title": "name", "predicateUrl": "name", }], "urlTemplate": "_:" } }

I am not sure about the urlTemplate. The conversion document for RDF simply says that a new Blank Node should be defined for each row and we should not define the blank node ID in the metadata because that is against the very notion of blank nodes in RDF. (The RDF conversion generates abstract triples.)

This is then merged in with the (empty) metadata to create the metadata used to process the document. (Note this being the case, the purpose of the Core Tablular Data processing model is not clear).

If there were directory-specific metadata (such as in Example 22 in Foreign Key Reference Between Resources, the description of merging in csv2rdf seem to be at odds with my notion of what it means to merge documents together.

I do not see where that would be. The (intention of the) conversion documents do not speak about merging of metadata documents at all. What it describes is how specific metadata entries that are inherited flow down to, say, cell level; but the metadata in its globality is considered to be a given as far as I can see.

I think processing of metadata should probably be left to the metadata and model documents, so that it's not repeated needlessly. If we establish rules for extracting metadata from a CSV, along with clear precidence rules for Inherited Properties, there should be no need to repeat this in the csv2rdf and csv2json documents.

I think the repetition is related to the way inherited properties behave. And yes, I agree this is something that should be abstracted out somewhere. I believe this is the same as issue #112, isn't it?

In any event, the merge semantics described here seem to be at odds with the merge order described above. For example, merging the embedded metadata into the directory metadata would replace the "name" and "urlTemplate" values with the extracted values. I realize that the wording says not to overwrite those values (or to order "title" content), but that seems like a reverse merge to me.

Perhaps the order should make the embedded metadata at a lower precidence, so that we start with embedded metadata, and merge in metadata found in steps 2-4 above before merging in user-provided metadata. The primary consideration would be for merging in the "title" property, which would need to ensure that merged-in "title" has a non-empty intersection with any existing value, and that the results of the merge are ordered with the original value first, followed by other values (although, this is significantly complicted by the use of language-maps, which are inherently unordered).

Well, actually... I wonder whether the "title" property should be part of that default structure in the first place. My initial instinct says that only the "name" property is defined by the first line; the absence of the "title" property does not fundamentally affect the JSON/RDF output.

To be discussed at a call, I guess; I am not sure I agree with the reversal of the priorities...

Ivan

Processing is still complicated, as Dialect values probably need to be applied before the input is processed which implies two passes of the input document: first to extract embedded metadata, and then to process the file including all embedded, found, and user-specified metadata.

I think that these change, combined with consolodating metadata processing in metadata & model documents, would make things much simpler. (And, if the Core Tabular Data could use the same rules, but just change the default for headerRowCount to 0, it could be simplified further still).

— Reply to this email directly or view it on GitHub.


Ivan Herman, W3C Digital Publishing Activity Lead Home: http://www.w3.org/People/Ivan/ mobile: +31-641044153 ORCID ID: http://orcid.org/0000-0003-0782-2704

gkellogg commented 9 years ago

I am not sure about the urlTemplate. The conversion document for RDF simply says that a new Blank Node should be defined for each row and we should not define the blank node ID in the metadata because that is against the very notion of blank nodes in RDF. (The RDF conversion generates abstract triples.)

You're right, this should just stay empty. My thought was that "_:" would expand to a new BNode, but that's not necessary.

I do not see where that would be. The (intention of the) conversion documents do not speak about merging of metadata documents at all. What it describes is how specific metadata entries that are inherited flow down to, say, cell level; but the metadata in its globality is considered to be a given as far as I can see.

I was referring to the model document's description of the order of relevance of metadata:

  1. metadata supplied by the user of the implementation that is processing the tabular data
  2. metadata embedded within the tabular data file itself
  3. metadata in a document linked to using a Link header associated with the tabular data file
  4. file-specific metadata in a document located based on the location of the tabular data file
  5. directory-specific metadata in a document located based on the location of the tabular data file

The implication to me is that metadata takes effect in that order. This could be implemented using the merge steps defined in @JeniT's import mechanism in that order. If not that, there is no other way specified. The problem is that you can't really run step 2 until you've considered steps 3-5, and it's not clear what the purpose of providing "name" (or "title" or "predicateUrl") would be in steps 3-5 if it's set in the embedded metadata. Allowing "name" to be specified in 3-5 allows for better naming when generating predicates. Using the import merge rules, setting "title" from embedded metadata would seem to have the desired effect, and then defer defining "predicateUrl" and "name" (if necessary) until after all metadata has been merged. This would have the effect of allowing columns to be specified only from embedded metadata, unless explicitly defined in 1,3-5.

I think the repetition is related to the way inherited properties behave. And yes, I agree this is something that should be abstracted out somewhere. I believe this is the same as issue #112, isn't it?

Yes, I think #112 covers it.

iherman commented 9 years ago

On 05 Jan 2015, at 20:41 , Gregg Kellogg notifications@github.com wrote:

I am not sure about the urlTemplate. The conversion document for RDF simply says that a new Blank Node should be defined for each row >and we should not define the blank node ID in the metadata because that is against the very notion of blank nodes in RDF. (The RDF >conversion generates abstract triples.)

You're right, this should just stay empty. My thought was that "_:" would expand to a new BNode, but that's not necessary.

I do not see where that would be. The (intention of the) conversion documents do not speak about merging of metadata documents at all. >What it describes is how specific metadata entries that are inherited flow down to, say, cell level; but the metadata in its globality >is considered to be a given as far as I can see.

I was referring to the model document's description of the order of relevance of metadata:

• metadata supplied by the user of the implementation that is processing the tabular data • metadata embedded within the tabular data file itself • metadata in a document linked to using a Link header associated with the tabular data file • file-specific metadata in a document located based on the location of the tabular data file • directory-specific metadata in a document located based on the location of the tabular data file The implication to me is that metadata takes effect in that order. This could be implemented using the merge steps defined in @JeniT's import mechanism in that order. If not that, there is no other way specified. The problem is that you can't really run step 2 until you've considered steps 3-5, and it's not clear what the purpose of providing "name" (or "title" or "predicateUrl") would be in steps 3-5 if it's set in the embedded metadata. Allowing "name" to be specified in 3-5 allows for better naming when generating predicates. Using the import merge rules, setting "title" from embedded metadata would seem to have the desired effect, and then defer defining "predicateUrl" and "name" (if necessary) until after all metadata has been merged. This would have the effect of allowing columns to be specified only from embedded metadata, unless explicitly defined in 1,3-5.

Ok, I think that is now moved to the dedicated issue #145; clearly these are intertwined.

Ivan

I think the repetition is related to the way inherited properties behave. And yes, I agree this is something that should be abstracted >out somewhere. I believe this is the same as issue #112, isn't it?

Yes, I think #112 covers it.

— Reply to this email directly or view it on GitHub.


Ivan Herman, W3C Digital Publishing Activity Lead Home: http://www.w3.org/People/Ivan/ mobile: +31-641044153 ORCID ID: http://orcid.org/0000-0003-0782-2704

JeniT commented 9 years ago

I've made a new commit (c03166592d5e419936ca403a6d9e00bfe9838485) that includes some worked-through examples of how an annotated tabular data model is created in different circumstances. It doesn't spell out everything, but I'm hopeful this is a start.

iherman commented 9 years ago

One pending issue (more exactly, what you think about it:-) is covered by these examples. The dialect information in the metadata seems to be descriptive rather than prescriptive. Meaning that the dialect specific setting like usage of tab instead of comma is not taken from the metafile (wherever that is) but, rather, from a tool-specific setting that is not standardized by us. Is this indeed your tought? (It certainly is mine...) Which takes care of one of Gregg's issues, namely that one has to get all the metadata and merge them before parsing...

That being said: if the dialect is descriptive, then I wonder what the usage of it is. It is not used right now (as far as I remember) in the RDF/JSON output... But, of course, other tools can decide to use that.

Ivan

On 07 Jan 2015, at 15:14 , Jeni Tennison notifications@github.com wrote:

I've made a new commit (c031665) that includes some worked-through examples of how an annotated tabular data model is created in different circumstances. It doesn't spell out everything, but I'm hopeful this is a start.

— Reply to this email directly or view it on GitHub.


Ivan Herman, W3C Digital Publishing Activity Lead Home: http://www.w3.org/People/Ivan/ mobile: +31-641044153 ORCID ID: http://orcid.org/0000-0003-0782-2704

gkellogg commented 9 years ago

The specific dialect things I was concerned about are skipRows, skipColumns and headerColumns. I also think that things like separator, quoteChar, and lineTerminator might come from metadata. Of course, there may be forms of tabular data which do require tool-specific rules (such as HTML table as a source of tabular data).

If the rules for considering both user-supplied and embedded metadata are different than my interpretation, then this can be taken from found metadata (steps 3-5) and used to interpret the input file (CSV). We might then have a means of reconciling embedded annotations from the input file after having determined necessary dialect information; in my mind, this probably just comes down to notes and titles.

The difference is when there is no user-suplied or found metadata, metadata needs to be created from embedded information only, in which case name and propertyUrl are determined from title.

gkellogg commented 9 years ago

To clarify:

My interpretation is that name is an optional property. When accessing the value of name, if is not set explicitly, it is taken from the first value from title (in the appropriate language), if it exists, and _col=N otherwise. predicateUrl, then defaults to "#" + URI.encode(name). (interesting aside, :_col=1 is not a valid PNAME, and can't be encoded as such in Turtle, we may thing of an alternative representation, if that matters).

iherman commented 9 years ago

On 22 Jan 2015, at 21:40 , Gregg Kellogg notifications@github.com wrote:

To clarify:

My interpretation is that name is an optional property. When accessing the value of name, if is not set explicitly, it is taken from the first value from title (in the appropriate language), if it exists, and _col=N otherwise. predicateUrl, then defaults to "#" + URI.encode(name). (interesting aside, :_col=1 is not a valid PNAME, and can't be encoded as such in Turtle, we may thing of an alternative representation, if that matters).

We should look at the default metadata overall as a separate issue. Eg., there may be an approach to take the R2RML Direct Mapping as a starting point for the default metadata (wherever appropriate).

Ivan

— Reply to this email directly or view it on GitHub.


Ivan Herman, W3C Digital Publishing Activity Lead Home: http://www.w3.org/People/Ivan/ mobile: +31-641044153 ORCID ID: http://orcid.org/0000-0003-0782-2704

gkellogg commented 9 years ago

Resolved in PR #169 and on http://www.w3.org/2015/01/28-csvw-minutes.html.