ropensci / datapack

An R package to handle data packages
https://docs.ropensci.org/datapack
44 stars 9 forks source link

serializeToBagit ignores DataObjects that remotely reference data #122

Open mbjones opened 3 years ago

mbjones commented 3 years ago

A DataObject can include a dataURL that indicates that the bytes of the object are remotely stored on another server, rather than being either in memory or on the local filesystem (which are the other two options). When serializing a DataPackage to disk in BagIt format, the serializeToBagit function skips over any data objects that use the dataURL slot as the reference to data, thus breaking support for this serialization.

To fix, either:

The challenge with the second approach is we still need checksums for the remote objects. Technically this shoul dbe in the SystemMetadata for the DataObject, but its likely it was not calculated. If the remote object is a DataONE object, then the SystemMetadata should have the needed checksum.

Relates to issue #3 and #119

gothub commented 3 years ago

I don't see a way to reliably implement the second option. Currently 'datapack' is creating an MD5 payload manifest, which is required to include all files listed in fetch.txt. An MD5 checksum may not have been calculated and saved in the sysmeta for a remote object, for example SHA256 may have been saved. The remote file would have to be downloaded and the required checksum calculated.

BTW - should the checksum algorithm be update to "SHA-256"?

Interestingly, here is the breakdown of DataONE checksum usage, with SHA-256 the most frequent:

https://cn.dataone.org/cn/v2/query/solr/?q=formatType:(DATA%20OR%20METADATA)&facet=true&facet.field=checksumAlgorithm&rows=0
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">112</int>
<lst name="params">
<str name="q">formatType:(DATA OR METADATA)</str>
<str name="facet.field">checksumAlgorithm</str>
<str name="rows">0</str>
<str name="facet">true</str>
</lst>
</lst>
<result name="response" numFound="2330070" start="0"/>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="checksumAlgorithm">
<int name="SHA256">257452</int>
<int name="SHA1">206933</int>
<int name="MD5">158680</int>
<int name="SHA-1">36847</int>
<int name="SHA-256">444</int>
</lst>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
<lst name="facet_intervals"/>
<lst name="facet_heatmaps"/>
</lst>
</response>

DataONE has the MNRead.getChecksum() service that will calculate any of the known checksum algorithms for a pid, but using that would put a dependency on the dataone package.

amoeba commented 3 years ago

This'd be nice to see. A few of points:

  1. A bag can have multiple payload manifests, one for each of the checksum algorithms used. So it's valid to have a manifest-md5.txt and manifest-sha256.txt in the bag.
  2. A bit of a hack but the LoC checksums list has an entry for unk (Unknown). So maybe we could just use that as a third manifest file, manifest-unk.txt and put some bogus value in for the checksum value? The BagIt spec only specifies:

    the checksum algorithm SHOULD be registered in IANA's "Named Information Hash Algorithm Registry

    so I think this might not be making the bag invalid.

  3. The docs for dataUrl say it's for lazy-loading of DataONE DataObjects and doesn't describe its use for the use case at the top here. Might be good to update that documentation to make it clear.
mbjones commented 3 years ago

@amoeba If we use multiple payload manifests, do we have to provide checksums for all objects in each manifest (e.g., for both MD5 and SHA-256)? Or can some objects be listed in MD5 and others in SHA-256 as long as each object is somewhere?

amoeba commented 3 years ago

Oh right. I misinterpreted what I read. Appears to be the former:

o Every payload manifest MUST list every payload file name exactly

:(

gothub commented 3 years ago

As discussed at the dev meeting yesterday, an approach to resolve the checksum mismatch issue is to set a default check algorithm for a DataPackage. DataPackages can be downloaded and created using the following workflows, so the appropriate checksums need to be provided for BagIt serialization for each of these creating/composition methods:

These use cases can be fulfilled with the following changes:

Serializing downloaded packages is a bit more difficult, as a package might be composed of objects that may not all use the same algorithm. Therefore, I suggest that an algorithm be specified (or the default used) when downloading objects and package. These changes would be made to the appropriate 'dataone' package functions:

or for an entire DataPackage:

dp <- getDataPackage(d1c, ..., checksumAlgorithm="SHA-256")

Updating these methods from the 'dataone' package is necessary for the case that objects are lazy loaded, where the data bytes for an object are not present locally, and may include content that is prohibitively large and should not be downloaded in order to calculate the checksum locally.