w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
154 stars 47 forks source link

Dcat issue 1526 bis #1578

Closed riccardoAlbertoni closed 1 year ago

riccardoAlbertoni commented 1 year ago

A draft to discuss possible improvement for privacy and security section.

Preview: https://raw.githack.com/w3c/dxwg/dcat-issue-1526-bis/dcat/index.html#security_and_privacy

Diff: https://services.w3.org/htmldiff?doc1=https://w3c.github.io/dxwg/dcat/&doc2=https://raw.githack.com/w3c/dxwg/dcat-issue-1526-bis/dcat/index.html#security_and_privacy

agreiner commented 1 year ago

The introductory paragraph is all about privacy, but the mitigation with a checksum is only about validating the integrity and authenticity of the data, not privacy. So we seem to introduce one topic and then go on to discuss how we can mitigate something entirely different.

I don't think rights statements could potentially include or reference sensitive information such as user and asset identifiers. That is, I don't think that any user identifier in the context of stating rights over a dataset would or could be a breach of privacy. (One cannot expect privacy when asserting a public right.) The real issue is how well one can secure data about people. But detailing how to secure web content and authenticate users is beyond the scope of DCAT.

The checksum is for the distribution described, not the metadata about it. But since the metadata is often offered in the same download as the distribution, the checksum would not provide authenticity unless provided separately. One must either provide the metadata in a secure manner and separately from the data, or also provide the checksum separately.

typo: "different checksum algorithmS might be deployed"

" It is worth noting that the associated checksum will not provide the expected security protections if ..." The checksum does not provide security, only authenticity/integrity.

DCAT providers should make DCAT distribution files (not just metadata) downloadable from authoritative origins.

I don't think we need even mention the RDF dataset canonicalization and hash work. The case of a single entity offering distributions really only calls for a hash for each file provided, and indicating which hash algorithm was used.

I think Nick's main point is that a checksum should be provided via a route that is separate from the data. It may be included in metadata that is provided with the data, but if so it should also be provided separately to prevent an attacker from manipulating it along with the data. The authenticity of a dataset cannot be assumed if the authenticity of the hash cannot be assured.

Section 6.17: I also think that we should say something about the use of a checksum as an indicator of an update or download error rather than just for verifying integrity.

agreiner commented 1 year ago

I took a stab at rewriting this. Unfortunately, I could not for the life of me figure out what branch in github to use, so I will just type it in here.

Security and Privacy Considerations

The DCAT vocabulary supports datasets that may contain personal or private information. In addition, the metadata expressed with DCAT may itself contain personal or private information, such as resource creators, publishers, and other parties or agents described via qualified relations. Implementers who produce, maintain, publish or consume such vocabulary terms must take steps to ensure security and privacy considerations are addressed. Sensitive data and metadata must be stored securely and made available only to authorized parties, in accordance with the legal and functional requirements of the type of data involved. Detailing how to secure web content and authenticate users is beyond the scope of DCAT.

Some datasets require assurances of integrity and authenticity (for example, data about software vulnerabilities). For these, checksums can serve as a type of verification. DCAT borrows the spdx:Checksum class from [[!SPDX]] to ensure the integrity and authenticity of DCAT distributions. Publishers may provide a checksum value (a hash) and the algorithm used to generate the hash for each resource in the distribution. A checksum must, however, be provided via a route that is separate from the data it sums. It may be included in metadata that is provided with the data (e.g., a tarfile that includes a file for the distribution and a file for the metadata that includes a checksum for the distribution file), but if so the checksum, or a checksum for the metadata, must also be provided separately to foil an attacker who would manipulate the checksum along with the data. A checksum provided in DCAT metadata will not provide the expected assurances if the integrity and authenticity of the metadata are not also guaranteed.

Integrity and authenticity of DCAT data ultimately depend on the trustworthiness of the source. DCAT providers should address integrity and authenticity at the application level and transport level. For example, they should ensure the integrity and authenticity of their API and download endpoints, make DCAT data and metadata files downloadable from authoritative HTTPS origins, and provide any checksums via a separate channel from the data they represent.

riccardoAlbertoni commented 1 year ago

I took a stab at rewriting this. Unfortunately, I could not for the life of me figure out what branch in github to use, so I will just type it in here.

Thanks, @agreiner: I have implemented your suggestion in #1579, check if I considered them correctly.

riccardoAlbertoni commented 1 year ago

I am closing this PR as it has been superseded by PR #1579.