w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
144 stars 55 forks source link

Example 60, 61, 62, why is accessURL included #1437

Closed smrgeoinfo closed 2 years ago

smrgeoinfo commented 2 years ago

In Examples 60, 61, and 62 , both accessURL and downloadURL are included, and they are the same URLS. These are all downloadURLs, so I don't think accessURL should be included in the example rdf.

example from 61:

  dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/data.tar> ;
  dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/data.tar> ;
simsong commented 2 years ago

It seems from the demo that they both download a tar file. it would be really nice if the standard distinguished a URL from which a dataset was downloaded in its entirety from one where a dataset is accessed through an API, but I realize that this may not make sense.

jakubklimek commented 2 years ago

@smrgeoinfo This example is from a DCAT-AP compliant catalog, where dcat:accessURL is mandatory, and should be a duplicate of dcat:downloadURL whenever dcat:downloadURL is available. This is just an explanation of the example. It does not mean that it needs to stay this way.

smrgeoinfo commented 2 years ago

@jakubklimek -- interesting, but isn't DCAT-AP provision inconsistent with the DCAT definitions of accessURL and downloadURL

jakubklimek commented 2 years ago

@smrgeoinfo I do not see a conflict here. DCAT says that dcat:downloadURL should be used when a direct download is available, but it does not prohibit duplication in dcat:accessURL

smrgeoinfo commented 2 years ago

dcat:accessURL: A URL of the resource that gives access to a distribution of the dataset. E.g. landing page, feed, SPARQL endpoint. I don't think a link to a .tar file meets this definition.

jakubklimek commented 2 years ago

@smrgeoinfo If this would be a consensual interpretation, then it would be a much bigger issue for DCAT-AP, not only an issue in this example, as I explained above.

kcoyle commented 2 years ago

Hmmm. In DCAP-AP dcat:accessURL is mandatory, and defined as:

  This property contains a URL that gives access to a Distribution of the
  Dataset. The resource at the access URL may contain information
  about how to get the Dataset. 

dcat:downloadURL is optional, and is defined as:

This property contains a URL that is a direct link to a downloadable file in a given format.

Given that there are two properties, it makes sense that they would have different semantics. It appears that DCAT-AP has blurred that. The general "rule" (if there is one) for APs is that they may not expand the definition of the properties that they are reusing, but may narrow definitions if the original property has minimal semantic commitment. The usage in the AP must be compatible with the defined semantics in the original vocabulary. Had DCAT-AP used the original definition of dcat:accessURL that would have made the difference between them clear. However, it would seem that the constraint on these two properties would have been that one or the other is mandatory, whichever fits the specific case, and if both are used they would have different IRIs as their subjects. This does not solve @jakubklimek's problem with existing data that uses the two properties. The dilemma is: change and invalidate existing data, or don't change and continue to create data with this ambiguity.

That said, in the work we're doing on tabular application profiles in Dublin Core we still have not come up with a good way to say "A or B" as a property constraint. It's a common need and I would very much like to be able to provide that option without greatly complicating what is now a simple table format. I began (and need to finish!) a transformation of DCAT-AP to DC TAP and it works well, but it would not accommodate the either/or that this requires.

kcoyle commented 2 years ago

Thinking about this some more, I am more of the mind that the problem is in having assigned "mandatory" to one of them. I'm leaning toward having something like:

Mandatory: oneOf[dcat:accessURL|dcat:downloadURL] - cardinality 1,0

Optional: (or recommended?) dcat:accessURL cardinality 1,0 dcat:downloadURL cardinality 1,0

This doesn't quite solve the problem (because it depends on if the cardinality of the "oneOf" interacts with the cardinality of the optional statements) but the details could be resolved in a SHACL or ShEx expression of the rules.

aisaac commented 2 years ago

@kcoyle is your proposed solution for DCAT-AP or DCAT? If it is for DCAT-AP then it could be worth flagging to them!

Personally I think in any case we shouldn't change what DCAT says, and remove the dcat:accessURL from the examples. I remember it was quite difficult for me to understand the difference between dcat:accessURL and dcat:downloadURL, and the sentence that @smrgeoinfo mentions (at https://w3c.github.io/dxwg/dcat/#Property:distribution_access_url) was a key hint for me. Also the following one "dcat:downloadURL is preferred for direct links to downloadable resources" that kind of expresses that dcat:accessURL and dcat:downloadURL are a bit of an alternative. I agree this is not a formal equivalent, but if there are examples that apparently contradict this usage note, then the picture becomes blurry again!

kcoyle commented 2 years ago

@aisaac AFAIK DCAT itself doesn't include cardinality, so I believe this only relates to DCAT-AP. My reading of dcat:accessURL vs dcat:downloadURL is that the former points to a web site and the latter to an actual file to be downloaded. That seems clear enough. But it becomes muddied a bit in DCAT-AP because dcat:accessURL is mandatory, yet it seems that one doesn't always have a website URL to put there. Thus, people have filled that in with the same link as the dcat:downloadURL simply because they have to fill it in. I think the question by @smrgeoinfo that opened this issue might require the DCAT-AP folks to rethink the cardinality constraints around these two properties.

agreiner commented 2 years ago

As a developer, I much prefer the sort of requirement that DCAT-AP is using, because it's a pain in the neck to deal with extra conditionals for things like this in code. I would much rather be able to grab a bunch of variables from a database and send them to the UI without parsing them each in some special way. If you're building a web app that needs to return metadata for online datasets, you are certain to need to return something as a URL. Since a direct download URL isn't necessarily available for everything, the access URL is the logical choice to require.

Also, I think the usage note is a little confusing. It says, "If the distribution(s) are accessible only through a landing page (i.e. direct download URLs are not known), then the landing page URL associated with the dcat:Dataset SHOULD be duplicated as access URL on a distribution." That is true, and if the dataset is marked up with a landing page URL, I suppose it is a duplication to use the same one for the access URL of the distribution, but the real question here is about how the two URLs for the distribution relate to each other. I think it would be more helpful to say "If the distribution(s) are accessible only through a direct download, then the download URL associated with the distribution SHOULD be duplicated as the access URL."

smrgeoinfo commented 2 years ago

I agree with @kcoyle; accessURL should not be mandatory. A distribution might be via a landing page or there might be a direct download URL; either one might not exist. For UI's I'm interested in, I'd like to let the user know if the URL is going to get data or if its going to get another web page. More importantly, for machine actionable metadata, the distinction is critical so a client app will be able to know it can get data to work with. Seems like the condition should be that one of accessURL or downloadURL is mandatory on a distribution.

agreiner commented 2 years ago

My understanding of accessURL has been that it is not specifically a landing page. It is the URL to which one would direct a user wanting to access a distribution. It may be a downloadable file, and it may be a landing page, and it may be an API endpoint. The latter is explicitly mentioned in the definition, so I don't think it was ever intended to be limited to landing pages. The fact that a downloadURL isn't listed among the examples doesn't inherently exclude it, though I can see that the wording of the note is ambiguous, as @aisaac pointed out. Rather than change the meaning for what is in a sense the most fundamental property of a distribution on the web, I'd suggest we just clarify. As for the functionality that @smrgeoinfo wisely asks for, you can determine that an accessURL is a download page if it matches downloadURL. You can determine that it is an API if it matches the endpointURL of the given accessService. An app that requires machine actionable data would not be able to do anything with an accessURL that isn't for a download or an API endpoint anyway, so wouldn't it just look for downloadURLs and endpointURLs to begin with? Moreover, making accessURL something that may or may not contain data doesn't make it any easier to determine the type of the resource. It becomes one of at least three properties that need to be analyzed before use, might disagree with each other, and might all be left blank for the wrong reason. Keeping the broader definition and requiring accessURL also gets around the problem of having no good facility for requiring "one of" that @kcoyle noted.

aisaac commented 2 years ago

Nice discussion! And I think that @agreiner 's point that a data client should try to always fetch downloadURL or accessURL is right.

I would just recommend warning against wording like this:

you can determine that an accessURL is a download page if it matches downloadURL.

To me it doesn't flag clearly enough that in principle such match should never occur. downloadURL being for downloadable files and accessURL for pages, then there's something wrong if a page is indicated in downloadURL. The same way that it does not feel right if a file to download is indicated via accessURL. It's a pity that the cardinality of the properties in DCAT-AP encourages this confusing duplication between accessURL and downloadURL, as @kcoyle points.

(and at this stage I hope I'm still right, as I got the feeling that your points clarifies my question above ;-) )

agreiner commented 2 years ago

I think maybe you misunderstand me. I don't agree that in principle such matches should never occur. If accessURL is taken as a single field to be used whenever one needs a reliable link for a dataset, it can potentially be a friendly landing page, a download location, or an API endpoint. In that case, there is nothing wrong if a file to download is indicated there. I'm not aiming at any kind of symmetry here. There are good reasons for having a single field for this, and I think that's exactly why DCAT-AP requires it.

aisaac commented 2 years ago

Sorry @agreiner I misunderstood you, indeed. And for clarification then I'm on the side of trying to separate these two properties as much as possible, following the approach of "dcat:downloadURL is preferred for direct links to downloadable resources." (in the notes for dcat:accessURL). I agree that in principle having a single property would be interesting for data consumption, but then for machine clients it would require a complete declarative apparatus to further describe the objects of that property (say, to indicate that a given URL is an HTML landing page for humans or a CSV file with real data in it). But DCAT doesn't have this, nor it encourages to use one, as far as I can see in the documentation for accessURL.

agreiner commented 2 years ago

I don't see that use case of determining whether the accessURL is a landing page or a download or something else as a realistic one. If I were writing an app that needed one or the other, I would read just the one relevant field that I'm looking for, e.g., downloadURL. I'm thinking it would behoove us to see what others outside this group think. We see DCAT-AP using it in the way I had expected. Also, the ODI do the same. Are there other users doing the opposite? I still think we should be clarifying rather than altering vocabulary that others are relying on.

smrgeoinfo commented 2 years ago

@agreiner "If I were writing an app that needed one or the other, I would read just the one relevant field that I'm looking for, e.g., downloadURL. " in the ODI example you link to, there is only a downloadURL, but the cardinality in DCAT requires an accessURL; the markup in the document only has an accessURL, A machine client parsing this would have to guess that that URL is likely to be a download URL, possible of course since the mediaType is text/csv, but more work for the client. See comment above.

kcoyle commented 2 years ago

It does make me wonder why there are two different properties in DCAT, unless (and I didn't read it that way) dcat:downloadURL is a further specification of dcat:accessURL. If the latter is a sub-property of the former then @agreiner 's analysis makes sense, and dcat:accessURL could have as its value a URL that is also appropriate in dcat:downloadURL. However, the usage note reads:

dcat:accessURL SHOULD be used for the URL of a service or location that can provide access to this distribution, typically through a Web form, query or API call.

dcat:downloadURL is preferred for direct links to downloadable resources.

Which tells me that the DCAT vocabulary does see them as having different uses, albeit softened with "should" and "preferred."

The issue of making only dcat:accessURL mandatory is only a problem because of how the documentation separates all properties into categories like "mandatory" "optional" "recommended". There is no technical reason why you could not have a rule stating that either dcat:accessURL or dcat:downloadURL must present, just not in the simple table form that the documentation uses.

In the end, my main concern is that an application profile using terms from a vocabulary must be true to the semantics of the terms being used. The fact that DCAT 1) does not define these as disjoint and 2) uses non-binding language, probably means that the DCAT-AP is technically correct. I appreciate the practicality of Annette's use case, but also @smrgeoinfo 's rebuttal.

agreiner commented 2 years ago

See the usage note under DownloadURL in DCAT 1. https://www.w3.org/TR/2014/REC-vocab-dcat-20140116/

kcoyle commented 2 years ago

@agreiner I see the difference between DCAT 1 usage note and the current one, but DCAT 1 seems to me to make an even stronger distinction between access and download, saying that download is "A file that contains the distribution of the dataset in a given format." Whereas accessURL is "A landing page, feed, SPARQL endpoint or other type of resource that gives access to the distribution of the dataset." The key terms here, IMO, are "contains" vs "gives access to". Again, if there weren't a different in meaning between these terms it wouldn't make much sense to have them both. Maybe DCAT-AP needs a super-property that can be either? And applications would then know that it could be either?

agreiner commented 2 years ago

@kcoyle, I see the distinction you and others are making here, but I read it quite differently. I think we agree on what downloadURL is asking for. I think the operative words are under the definition of accessURL, where it says "A landing page, feed, SPARQL endpoint or other type of resource". It's very clear that they never intended to limit it to landing pages, and that they did intend it to apply to multiple types of resources. I read "access" as to obtain in some way, be it through a direct download, an API, or through a link on a landing page. @smrgeoinfo is right in noting the error in the ODI example of failing to include a downloadURL for a download, but that would be an error in any of our views. If we required only one of the two, users could still just as easily fail to include a downloadURL when they had one. What I wanted to point out there is that they interpret accessURL as legitimately describing a download location. That shouldn't be surprising, given that several other types of resource are explicitly mentioned in the definition, along with "other type(s) of resource". I don't see that as ambiguous at all.

kcoyle commented 2 years ago

@agreiner What you describe is something I mentioned before, which it seems that downloadURL is a more specific property whose value could be included in accessURL, which I would define as a property/subproperty relationship. I don't know if it works this way, but perhaps in that case downloadURL would be considered an accessURL for the purposes of searching. I'm thinking of an analogy to the way that SPARQL manages retrieval using class relationships. In any case, I think that making a specific relationship between them would clarify their semantics. It also would provide an opening for other subproperties if those are desired. This then brings up the issue that one of the options in accessURL is "service" yet service is defined elsewhere in DCAT with its own endpointURL.

So I won't comment further, because it is just getting more messy in my head. I suspect that the answer is to leave DCAT-AP as is, since people seem to have found a comfortable way to make use of it.

makxdekkers commented 2 years ago

@kcoyle Just to note that DCAT1 explicitly did not make download URL a subproperty of accessURL in the usage note: "DCAT does not define dcat:downloadURL as a subproperty of dcat:accessURL not to enforce this entailment as DCAT profiles may wish to impose a stronger separation where they only use accessURL for non-download locations".

riccardoAlbertoni commented 2 years ago

Closing as a result of the resolution https://www.w3.org/2022/02/08-dxwgdcat-minutes.html#r02