usnistgov / oar-pdr

The NIST Open Access to Research (OAR) Public Data Repository (PDR) system software
11 stars 10 forks source link

MIDAS ingest: use POD only for determining files to include #62

Closed RayPlante closed 5 years ago

RayPlante commented 5 years ago

This PR follows on PR #57 that addressed a bug where file metadata was being lost while processing a resource update. The underlying problem, not fully addressed, related to the fact that during an update, not all of the files will be present in the SIP directory.

A bit of history: when the PDR publishing code (i.e. the metadata server and the preservation service) were first developed, MIDAS was not guaranteed to write out distribution nodes to the output POD file until the user was ready to publish. To allow for this, the PDR publishing code would look at both the POD record and the files present in the SIP directory to determine what files were to be considered part of the resource. Eventually, MIDAS changed to write the distribution records as soon as the user uploaded them. Further, it is possible for the user to delete distributions via MIDAS; doing so, however, does not actually remove the files from disk. For these reasons, it is important that the PDR only look at the POD when determining if a file is part of a resource.

To update the PDR code properly required that the module nistoar.pdr.preserv.bagit.builder be refactored to be clearer about when metadata is being updated or replaced. The change in its API required users of that module to be updated as well.

This change is best tested with the demo system within oar-docker, which also required changes. If one tests this PR with oar-docker prior to 10c2139, the updating the test dataset fails because a new file (LICENSE.txt) is not listed in the POD file; after 10c2139, the demo update works correctly.

RayPlante commented 5 years ago

Reconfirmed working under oar-docker demo as described above.