w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
139 stars 55 forks source link

authenticity and integrity of dcat files and associated datasets #1526

Open npdoty opened 1 year ago

npdoty commented 1 year ago

The spec should address providing integrity and authenticity of dcat files and associated datasets.

As a security matter, it's not clear how authenticity or integrity of metadata files or the associated datasets are assured. A checksum property for the dataset file is available (new in DCAT 3), but there seems a risk of a kind of downgrade attack here: someone tampering with the dataset might at the same time be able to tamper with the metadata and its checksum property.

Authenticity and integrity might be important security properties to consider; signatures and potentially use of a public key infrastructure might make it possible for a consumer of a dataset to confirm that they know who it came from and that they received it without tampering.

riccardoAlbertoni commented 1 year ago

Thanks for the feedback. We discussed the issue of integrity and authenticity in the DXWG plenary. Let me try to summarize part of the discussion below.

The core of our work is DCAT as a metadata model, and integrity and authenticity seem to relate more to how DCAT is provided than the DCAT model itself. We are reluctant to address issues “Not at the core” of the group mandate. We want to avoid our DCAT-limited perspective can later conflict with more devoted solutions stemming from new groups working on promoting transversal technology, which might be chosen to deliver DCAT metadata.

As one of the most typical ways to serve DCAT, the RDF encoding provides an example of the above concerns. Typically, a DCAT encoding in RDF might end in an RDF store or file. In the case of an RDF store, it is the chosen software which needs to implement the caveat to ensure integrity and authenticity. In the case of RDF files, other ongoing W3C groups deal with the integrity of RDF content. In particular, the “canonicalizing and cryptographically hash of RDF Dataset W3C Working Group” [1]. The existence of dedicated efforts shows the timeliness of your comments. If you think it might help, we can try to point to this ongoing initiative in the DCAT document. However, it seems reasonable first to wait for the RDF Dataset Canonicalization and Hash Working Group outcomes and check when their work consolidates if we can suggest adopting their recipes. Until then, we can move this issue to the Future work - possible new requirements milestone.

Can you live with this solution till the grounding RDF solutions are delineated more?

[1] https://w3c.github.io/rch-wg-charter/index.html

npdoty commented 1 year ago

I'm not convinced that it's wholly out of scope. One of the only features being added to this version is a checksum property, which is apparently intended to provide security protections, but doesn't provide the expected security protections if there's no way to provide integrity or authenticity of the DCAT metadata.

I'm not sure if the checksum property is fully defined enough that it can be generally interoperably used (is there implementation experience?), but that property assumes that there already exists a canonical way to refer to a distribution, if not a dataset.

If it's not feasible to provide standardized functionality for authenticity and integrity of DCAT files (or other distributions of the metadata) in the short term, then I think it would be reasonable to: 1) add a warning about the security implications of checksum properties when the metadata's authenticity has not been confirmed; and 2) list some ways to access DCAT metadata in an authenticated, secure way (downloaded over HTTPS from the expected origin, for example); and 3) mark it as an issue for a future version.

Postponing features has to happen sometimes. But I would strongly recommend that there be a plan to address this in the future, rather than just postponing it as a way to avoid dealing with it. Accessing datasets that could be tampered with, or not knowing the provenance or authorship or integrity of a dataset, is a real and significant threat; it affects far more than just the implementers of this spec. I don't think it can be our long-term plan that W3C Recommendations don't provide any mechanism for basic, interoperable security properties and instead rely on the hope that every individual implementation or user will figure out its own way to provide security.

riccardoAlbertoni commented 1 year ago

If it's not feasible to provide standardized functionality for authenticity and integrity of DCAT files (or other distributions of the metadata) in the short term, then I think it would be reasonable to:

  1. add a warning about the security implications of checksum properties when the metadata's authenticity has not been confirmed; and
  2. list some ways to access DCAT metadata in an authenticated, secure way (downloaded over HTTPS from the expected origin, for example); and
  3. mark it as an issue for a future version.

Postponing features has to happen sometimes. But I would strongly recommend that there be a plan to address this in the future, rather than just postponing it as a way to avoid dealing with it.

Thanks, Nick, for your suggestions; we've included them in the Security and Privacy section; check the second paragraph in https://w3c.github.io/dxwg/dcat/#security_and_privacy

Please feel free to suggest improvements to the draft.

If you can live with the current draft, we will backlog this issue for further consideration in the next standardization round of DCAT ( e.g., DCAT 4).

I'm not convinced that it's wholly out of scope. One of the only features being added to this version is a checksum property, which is apparently intended to provide security protections, but doesn't provide the expected security protections if there's no way to provide integrity or authenticity of the DCAT metadata.

We have acknowledged that in the new paragraph.

I'm not sure if the checksum property is fully defined enough that it can be generally interoperably used (is there implementation experience?), but that property assumes that there already exists a canonical way to refer to a distribution, if not a dataset.

This solution is adopted by DCAT-AP 2.1.0. The checksum property range in spdx:Checksum class, which specifies actual spdx:checksumValue and the spdx:algorithm used to produce the checksum. DCAT distribution might be in many other formats than RDF. As for RDF, there is a Group working on the RDF Dataset Canonicalization and Hash, and we prefer to wait for their outcomes before recommending anything in that direction.

Accessing datasets that could be tampered with, or not knowing the provenance or authorship or integrity of a dataset, is a real and significant threat; it affects far more than just the implementers of this spec. I don't think it can be our long-term plan that W3C Recommendations don't provide any mechanism for basic, interoperable security properties and instead rely on the hope that every individual implementation or user will figure out its own way to provide security.

We agree that this is a pervasive and transversal issue that impacts every vocabulary the W3C recommends, and this is the main reason why the solution should be common to all vocabularies. RDF Dataset Canonicalization and Hash Working Group will likely provide a ground upon which RDF vocabularies will build. Anyway, any further input to consider in the next standardization round is more than welcome.

pchampin commented 1 year ago

@npdoty are you satisfied with what we have added in the spec and @riccardoAlbertoni's response above, and can we close this issue?

bertvannuffelen commented 1 year ago

In addendum, I would add the effect of persistent URIs.

Suppose I have found in a portal, e.g. data.europa.eu a dataset (https://data.europa.eu/data/datasets/https-katalog-riksarkivet-se-store-1-resource-106?locale=en) and this has as PURI https://katalog.riksarkivet.se/store/1/resource/106. Then by dereferencing the PURI the orginal dataset description is found.

So if one does not trusted the harvesting portal, then one could via this mechanism find the source portal and the original metadata.

Now, one could argue that one does not trust the response of the HTTP dereferencing, which in the end comes down, I do not trust the source. The use of persistent dereferenceable identifiers is actually a simple yet powerful method to guarantee that the data is trustworthy.

Note that DCAT is about metadata descriptions. It describes the rules of the use of the data that it metadata wise describes. Thus, the issue of trust is actually way more complex than having the "original metadata descriptions". Suppose one uses the data that is found via a DCAT metadata description via a super secure data ecosystem for a data processing task that is infringing the legislation, then the super secure data ecosystem for DCAT will not be an argument that one could perform the data processing. This is all because DCAT does not provide data, but the means to access the data. And thus the trust/legal responsibility will be transferred to the data provider/data processor (in GDPR terminology) that is providing the access to the real data.

Also, I want to note that DCAT does not mean sharing the data in RDF format. I hope we agree as community that DCAT can be implemented in many technical ways, as long the semantics are preserved. I agree though that the most natural, and for conformance reasons, it should be possible to unambiguously transform the implemented data structured in a RDF structure. (This holds in my opinion for all domain vocabularies and application profiles.)

ps. on the checksum: that is about the file the Distribution is pointing to, not about the metadata (the dcat:Distribution).

npdoty commented 12 months ago

It's an improvement to at least have these concerns noted in the spec.

By convention (and to make it parallel with the following section), I would suggest "Security and Privacy Considerations" as a title.

I think "is also not guaranteed" should be "is not also guaranteed".

You might describe addressing these concerns at both the application level and the transport level -- that may be what you mean, but we would note in the Web context that an attacker could tamper with the contents between the server and client if a security-sensitive property like a checksum were delivered over an insecure transport.

This text seems to suggest that the checksum value and algorithm aren't typically sufficient for calculating and comparing checksums and that separately a publisher should provide instructions so that a checksum can be accurately calculated. Have there been interoperable implementations that do calculate and compare these checksums? Or is it just a case-by-case manual review of the documentation and then calculation of a checksum? If the latter, I'm not clear what interoperability we are getting by adding it to the spec.

(Apologies for my belated review and follow-up.)

riccardoAlbertoni commented 10 months ago

Thanks @npdoty, PR #1579 is a joint effort to elaborate the section on the base of your observations, see how it looks at https://w3c.github.io/dxwg/dcat/#security_and_privacy

This new phrasing of the section is quite an improvement. Do you think we can close this GIT issues?