w3c / csvw

Documents produced by the CSV on the Web Working Group
Other
162 stars 57 forks source link

Clarification of rationale for template expansion semantics #888

Open RickMoynihan opened 2 years ago

RickMoynihan commented 2 years ago

Hi,

I'd appreciate some clarification on a confusing area of the CSVW specification. I should state that the specification itself is very clear on the matter, what is unclear to me is the rationale for the decision and the expectations around it; it is the rationale that I'd mainly like a clarification on.

I've been digging through the CSVW specs and working group discussions (in github issues, IRC logs and minutes) trying to uncover the rationale for this myself, and I can't find any mention of it, so I thought I'd raise it here; so I can more clearly understand CSVW's design.

The area of clarification concerns the choice of base URL used for resolution of URI templates. In particular that the URL is that of the table's URL and not the metadata documents URL.

The relevant section of the spec is here and is specifically this sentence describing the process for generating the annotation value:

3: resolving the resulting URL against the base URL of the table url if not null.

I would have expected that the URL for resolution of these templates after prefix expansion would have been either the declared @base in the metadata document, or the metadata documents location. The decision here is highly confusing, and counter intuitive given that nothing else assumes the table url.

  1. all other properties are resolved in terms of the metadata documents base.
  2. prefixes are expanded as if they followed the rules of the CSVW @context under a JSON-LD interpretation (I'm being overly careful with my words here because it's not strictly JSON-LD)
  3. language tags are resolved in terms of the context

I've trawled the issues, minutes and commits for some clues as to why this was chosen as the specified behaviour here, and I can't see any explicit mention of it. Though there is a lot of discussion where the working group largely seem opposed to these semantics, and no openly discussed justifications that I can find for it. e.g.

@gkellogg: I believe we said elsewhere that information defined in a metadata file would be localized to that file. Certainly, a title needs to take the @language from the file in which it is located, doing the same with base URL with, or without @base being explicitly defined, is consistent with that. The main thing is that we be consistent, since anything else will only lead to more problems. link

@iherman: In my view @base should follow the priority that we have established for metadata. This means that @base, say, defined in the user metadata should override the metadata files. That makes things way simpler and it is simply a matter of properly defining things. link

Yes, I think we should be sure that template values are expanded first. I think it needs to be done when transforming into a Template used by the template processor, and before joining to the metadata base URL. I'll take the action of adding such text. link

@JeniT: I've clarified that the base URL for the metadata document in the absence of the @base property in the @context is the location of the metadata document itself. It's good practice for URLs to be taken as relative to the location where they're found. For example, the URLs of any imported metadata documents should be relative to the original metadata document or it will be incredibly confusing. It might be that there are particular properties that should be interpreted as URLs relative to the @id of a table description, but I think we need to define them explicitly as doing so. 6a6d74, which properties do you think they are? link

@JeniT: I think in many cases the propertyUrl will want to be resolved relative to the location of the document that it's located in (which might be a separate schema file that is shared across multiple files). I suggest that to handle the requirement for URI templates to sometimes be URLs that are relative to the processed CSV file, we have another special variable, eg _tableUrl, which is the URL of the table (from the url property as it is now). Then you can have things like {_tableUrl}#row={_row} if you want to generate URLs like that to identify rows. link

Given such a variety of views supporting this view that the metadata document, should be the base for URI resolution, what changed? Is the spec actually correct in this regard? I suspect it is, given @JeniT's comments here, the presence of supporting notes in the spec and some affirmative changes in the git history.

I've spent a long time investigating this issue and wondering if it was a bug in our implementation of the csv2rdf spec, an error in the specification itself, or a misunderstanding on my part. I feel I must be missing something important, a rationale to justify the semantics. Any clarification you can provide would be appreciated. My best guess is it's because a representation centric design of annotations was chosen rather than a model centric one -- however that's probably best left for another discussion!

I appreciate entirely all the work the working group has put into these specs, and am trying my hardest to make the most out of it. So additionally any advice on working around these issues that is clearly in the spec would be appreciated (hardcoding the full URIs everywhere is as far as I'm concerned a last resort). I was for example wondering if I could work around this by resolve these URI's relative to the @id on the table (which is I think one option Jeni suggests above); however some bits of the spec appear ambiguous about if that's how it should work.

gkellogg commented 2 years ago

That's quite some sleuthing! Generally, within JSON-LD, the @base is used for resolving document-relative IRIs within a document having an @context defining that @base (which must be in the document itself, not retrieved via a remote context). The metadata document describes the CSV, at at one point, could be used for different CSV files, so it's consistent with JSON-LD definitions that the @base declaration pertains to the metadata file (a JSON-LD file), and not the document it references.

The rationale I take from the discussion is that URI Templates provided an adequate way of expanding relative IRIs (URLs) in CSV documents. It would have been discussed in some detail at the February 2014 F2F in London (and day 2). I think you noted issue #191 which seems to go into this reasoning the most.

RickMoynihan commented 2 years ago

Thanks again for the prompt reply @gkellogg, I really appreciate you trying to help here.

That's quite some sleuthing!

Thanks; but this is only possible due to the working groups meticulous recording of the decision making process. So thanks again for that, it's very valuable having open access to the majority of the discussions! Unfortunately the one thing that appears to have been ommitted are the reasons for resolving against the table url.

Firstly regarding this:

Generally, within JSON-LD, the @base is used for resolving document-relative IRIs within a document having an @context defining that @base (which must be in the document itself, not retrieved via a remote context). The metadata document describes the CSV, at at one point, could be used for different CSV files, so it's consistent with JSON-LD definitions that the @base declaration pertains to the metadata file (a JSON-LD file), and not the document it references.

Yes, I'm in complete agreement with all of that! However this says nothing about the inconsistency, using the table URI to resolve URITemplates.

It would have been discussed in some detail at the February 2014 F2F in London (and day 2).

Thanks for linking these; I had seen some of the chat logs; but appear to have missed some of these ones.

I think the relevant bit of the logs are here on the first day(pasting as I can't link them directly):

<danbri> jenit: i think we agree that the link properties are resolved against the base url, maybe the @base from the context, or it may be the location of the metadata file, during normalization of the metadata file, and prior to merge.
[[[If the property is a link property the value is turned into an absolute URL using the base URL.]]]
<danbri> jenit: 2nd piece of this, is what happens to these url templates
<danbri> these can't get expanded until you are actually processing data
<danbri> at which point you have your merged metadata as basis of what you are doing
<danbri> if you have lost your base url, or not got, what to resolve against becomes tricky
<danbri> also - jtandy's 1st assumption, that those would be resolved against url of the csv file
<danbri> so when you had template like #rownum=5
<danbri> then that would be ref to something within the csv file
<danbri> not relative to any of the metadata files it might be in
<danbri> which raises the usability perspective, ...
<danbri> … it might be better for the url templates to be ref'd against the csv file
<danbri> to have that as the default
<danbri> gkellogg: i won't stand in way, but am not enthusiastic
<danbri> … you can always avodi trouble by having absolute urls
<danbri> jtandy: we just need to be clear on what happens when not an absolute url
<danbri> ivan: raising q: is it not confusing for authors, that we have 2 diff ways of absolutising urls
<danbri> depending on whether they are link properties or templates
<danbri> … a completely diff approach would be that we don't do this under normalization
<danbri> instead use the table url just like for templates
<danbri> jenit: how do you resolve the table url? that's the link property
<danbri> gkellogg: json-ld has an url expansion algo
<danbri> … nominally each json-ld doc has a location which can overide @base
<danbri> ...
<danbri> if we say it is undefined, this would be the only doc (format) i've dealt with in which you start off with a base and then lose it along the way
<danbri> ivan: talking about confusing, … that means I get a merged metadata, and the various templates in that metadata will expand differently
<danbri> … the templates will expand depending on where they come from
<danbri> gkellogg: no, there's a single base url notionally
<danbri> ivan: then i don't understand the issue
<danbri> gkellogg:I think we said it's the csv file it is expanded against
<danbri> that's what i reacted to , saying that this is weird, …
<danbri> jenit: [missed]
<danbri> discussion of detail of mess starting with the csv file vs metadata
<danbri> jtandy: key issue to my mind, uri templates only get expanded once you've done all the merging, ...
<danbri> … only at that point,
<danbri> gkellogg: only at row processing stage
<danbri> jtandy: … templates get expanded, … urls get resolved, …
<danbri> gkellogg: which we're saying is the expanded url property of the table
<danbri> jtandy: at least we always know what that is
<danbri> jtandy: to clarify, this is for the metadata doc, and by time we get to conversions, this will all have been expanded?
<danbri> [yes]
<danbri> jenit: do we in abstract table data model need url in each cell not just value
<danbri> i.e. what you'd get from value url
<danbri> gkellogg: that is the value of the cell
<danbri> jenit: no
<danbri> -> example in piratepad
<scribe> scribenick: gkellogg
iherman: just to clarify, linkproperty values can be CURIEs/PNames
<JeniT> https://github.com/w3c/csvw/issues/121
<danbri> gkellogg: discussion of expanding urls, we talked about json-ld, then asked about URL spec
<danbri> reason for that is that url spec doesn't deal with prefixes
<scribe> scribenick: danbri
ivan: spec-wise it is fine, but if i read that doc it is like some of the HTML5 specs
jenit: does it specify the behaviour that we want it to specify
… there is no other good url spec to reference
jenit: i think it is at least consistent to point to the json-ld one
ivan: that's why i asked what i asked. back then it went into a whole set of things that were v json-ld specific, with prefixes etc.
…that was my fear
… it goes into all kinds of detail on context processing
gkellogg: we are using a context, we have one defined that defines all of our terms, that is the one used when expanding these values
jenit: let's defer this, maybe discuss over lunch, ...
gkellogg: if we choose something else let's say it is intended to be consistent with json-ld iri expansion
ivan: one thing it does introduce, … and we do not, is issue of syntax for bnode identifiers
gkellogg: but we can constrain the value space...
jenit: suggest resolve as "we'll summarize the algo from json-ld spec, extract bits that are relevant, and say it is intended to be consistent with the spec
gkellogg: yes, can do that
… re bnodes i think it is intent of group to avoid using a bnode syntax where URIs can be used
ivan: maybe we need some sort of appendix
saying this is json-ld compatible, but with these-and-these restrictions
e.g. that we restricted what can go into a context
… that we have restricted yesterday the evlaution of common properties, etc.
… i.e. there are a number of places where we restrict json-ld
[general agreement]
resolved: We will summarise the expansion processing that is necessary for our purposes, and say that it is intended to be consistent with JSON-LD IRI expansion. We do have some restrictions on what IRIs can be used, eg we don't allow blank node syntax.

Unfortunately there is still no rationale here, just the change to resolve in terms of the table url. Dan Brickley, appears to briefly entertain the idea of having the resolution of the templates be in terms of the table url, however only offer justifications for why it's in his words "weird"; and the only argument in favour appears to be a vague and unspecified appeal to usability (which personally I find hard to reconcile):

which raises the usability perspective, ... … it might be better for the url templates to be ref'd against the csv file to have that as the default I think you noted issue https://github.com/w3c/csvw/issues/191 which seems to go into this reasoning the most.

Regarding issue #191 I think it also only argues for the @base for resolution being the metadata document; never the table url.