Closed empavia closed 4 days ago
Thanks @empavia , frictionless
is giving us a sha256 hash by default. It's in the hash
property of the metadata. Is that suitable? Or we could certainly build our own unique identifier. And honestly that might be best, just so that we're not overly dependent on frictionless for basic stuff like this. @phargogh , do you have any opinions?
I think the key question that would be helpful to answer is about how we identify a dataset.
Suppose we have a dataset "A", which is the first version released. It has a cryptographic hash of 123456
. Later on, dataset A has a second edition published, which has a cryptographic hash of abcdef
. Should these be identified as different datasets, or are they different editions of the same dataset?
If they are different datasets, then the cryptographic hash is perfect: it's already a unique identifier. If they should be identified as the same dataset, just different editions, then it would be helpful to have some sort of uniquely identifying abstraction to refer to the overarching data product.
Yeah, good points, James. We have an edition
property of the metadata, implying that we do want to keep track of multiple editions of the "same" dataset. But I'm not sure if we have a specific use-case in mind for that yet. Are there cases where we would want to track editions/versions for a dataset, and keep all the editions catalogued on CKAN? Or in that case we would only ever want to keep the latest edition in the catalog?
If they should be identified as the same dataset, just different editions, then it would be helpful to have some sort of uniquely identifying abstraction to refer to the overarching data product.
Right, we have the edition
property, but we did not think about how to identify the overarching product. I think the first step is to describe some real-world use-cases for tracking data product editions.
I could definitely see cases where we'd want to keep track of different versions of a dataset, but I'm not sure about the need to have all of them cataloged on CKAN. @empavia do you happen to have any such cases in mind?
Just thinking out loud here, all of the cases I can think of right now are ones where new versions of a dataset are released as though they are a completely independent version. A few examples are SRTMv3 (in contrast with v1 and v2), SoilGrids2017.
It does seem like some degree of versioning seems important for reproducibility. Imagine we made an update to an AWC layer. How will we want to identify that the layer has changed? Would we want to just include that as a new layer on the catalog? Thinking ahead, I'm not sure how a DOI might change if we updated the underlying data with editions, but the behavior is clear for if we had a new dataset.
Anyways, I guess I'm feeling like the simplicity of a SHA might be really nice, but @empavia I would be very interested to hear your thoughts on this.
I agree that I don't foresee new versions of data needing to be new 'editions' and was thinking of your same use cases. New versions of data, or new year releases, will likely just be different datasets as it will be important to provide options (such as having SRTM v3 and v4, for example).
The issue on reproducibility is important to discuss and think about. I imagined that an updated AWC layer (if we've rerun it with a change in a formula, for example) would replace the original version and may just need updated metadata and a new 'update date' on CKAN and in the metadata. I am not sure how that would work with the SHA or with a DOI, as it seems we may get a new one with a rerun set. However, is it possible to just keep the original SHA and DOI and only update the data itself and maybe some other pieces of the metadata that are relevant? For example, for AWC again, we just completely override the old version with the new version (assuming it is better quality or fixed something, etc) and maintain the original identifiers? Hopefully that makes sense!!
Curious to hear your thoughts on this, @phargogh. Seems like we are generally on the same page, though.
I think I'm leaning towards having the SHA256 (or another checksum, maybe <protocol>-<checksum>
, e.g. sha256-123456abcdef
) be the ID.
As I understand the intent of a DOI, I think the DOI is supposed to uniquely identify that specific layer, and the SHA256 (or whichever checksum we use) should also relatively uniquely identify the dataset. So if we update the AWC equation and produce a new layer, I would think that we should have a new DOI and a new SHA256.
One interesting thing that appears to be true (but not fully implemented in the CKAN UI) is that datasets appear to be able to be versioned ... but I am not finding anything in the ckanext-doi
package that indicates that versions are supported for DOIs, and it isn't even clear how one might be able to access an older revision.. So, I think our best bet at the moment is to consider each dataset in CKAN a one-off.
Having said that, we could also see about trying to fix these revision issues in CKAN and in ckanext-doi
in order to have a single dataset that can refer to multiple revisions. DOIs appear to be able to have custom suffixes, so it seems reasonable for us to modify ckanext-doi
to offer a custom suffix unique to the revision.
However, is it possible to just keep the original SHA and DOI and only update the data itself and maybe some other pieces of the metadata that are relevant?
Although we could leave the SHA256 unchanged in the metadata, doing so will undoubtedly create confusion down the road: folks will undoubtedly contact us asking why the computed SHA256 of the dataset they downloaded doesn't match the SHA256 stated in the metadata. So for the sake of future us, I think we may want to just stick with having the DOI and the checksum match the dataset. Conceptually, I kind of like the simplicity, too ... change the dataset, treat it like a new dataset.
@phargogh Thanks, James! I am in agreement that we should treat changing the dataset as a new dataset. You are right that we would want a matching SHA256.
Overall, it makes sense, and seems to be standard practice (if I understand your previous note correctly), to generate a new SHA256 and DOI for each new layer. I am interested in us exploring the versioning for DOIs, but it seems that may be a separate task from this. All in all, with the context you've given, I am happy to move forward on using the SHA256 as the ID!
Related to the current approach for generating the sha256, we are using frictionless.describe(filepath, stats=True)
to do so. And it uses the entire contents of the file to create the hash. That's a problem for large files and remote files.
I'm thinking of updating the unique identifier to be a hash of filesize + last_modified_time + filepath
. This is similar to the approach taken by taskgraph
for determining if a file has been modified since the last time taskgraph
has seen it.
@phargogh Are there any other "fast-hash" approaches we should consider for a unique identifier?
Ooo yeah, that's an interesting design problem when we have remote datasets! The main challenge I see is that by changing how the checksum is defined, we're also making it harder to verify the file's integrity. Of course, this is totally fine for a unique ID! But it's problematic for file verification.
At least for the use case of integrity checking, I don't see a way to avoid computing some kind of checksum. It would be a pretty straightforward operation to, for cloud-based files, compute the checksum in the cloud environment in order to avoid downloading the whole file to the local computer. For example (from SO), one could execute this on a GCS VM: gsutil cat gs://<bucket>/<file> | sha256sum
. If the objective is to simply catch any issues when downloading the file, GCS stores a CRC32 which, though not cryptographically secure, is quick to compute and can be requested from GCS. AWS S3 also computes checksums of files uploaded to buckets. Checksums also appear to be created within GDrive and can be accessed through Google Drive's API (I have not verified this).
I'm thinking of updating the unique identifier to be a hash of
filesize + last_modified_time + filepath
. This is similar to the approach taken by taskgraph for determining if a file has been modified since the last time taskgraph has seen it.@phargogh Are there any other "fast-hash" approaches we should consider for a unique identifier?
To directly answer your question, filesize and mtime
seem like the key bits of info to hash here! Adding in the filepath sounds like a good additional chunk of identifying data that would make it hard to collide with another hash. CRCs are pretty quick to compute, so that could be used (but I'm pretty sure these do require access to the whole dataset). UUIDs are also reliable for identifiers and typically use local cryptography hardware (e.g. /dev/urandom
) for the truly random parts of the ID, so they should be near-constant-time to generate, but have nothing to do with the data.
Cool, thanks for your thoughts! I think file integrity questions are beyond the scope of this project for now. And I think the unique identifier should include references to the data, like size, mtime, so that it remains the same unless the data has changed.
We implemented a hash of filesize + last_modified_time + filepath
The catalog requires a unique code to add datasets from metadata. While using the MCF, we implemented the use of a UUID. We would like to request the implementation of a UUID code into the frictionless schema for all data types.