ropensci / datapack

An R package to handle data packages
https://docs.ropensci.org/datapack
44 stars 9 forks source link

Refactor bagit serialization to match new specification #115

Closed ThomasThelen closed 2 years ago

ThomasThelen commented 4 years ago

Background

Pull request #112 enabled users to add file location information about a data object. It also placed that file in the appropriate directory (somewhere in ./data/) when serializing to bagit. The PR however didn't refactor the rest of the serialization code to match the new format that Metacat uses.

Survey of Changes

DataPakage::serializeToBagit

Write to Bag Files on The Fly

Previously, tag file information (manifest, tagmanifest, etc) were stored in objects and appended as package files were written. I've changed this to write to the files on the fly, as the data is generated. You'll see a few more references to file writing and you'll also see that I removed the objects like

  pidMap <- character() 
  manifest <- character()

Removal of pid-mapping.txt

This file is no longer supported in the current specification so I removed it

Writing System Metadata to The Bag

The current spec wants system metadata in ./metadata/sysmeta so you'll see some changes that reference system metadata.

Renaming Resource Map

The resource map now has the name oai-ore.xml and lives in ./metadata

Writing Science Metadata to Bag

The science metadata is now included in the exported bag. You'll note that I only added support for EML. The problem lies with finding a way to identify which DataObject is the science metadata document. The ideal way (and only way I thought of) is to ask the coordinating node for the format list, and then look for all of the ones that have type METADATA and check the format type. It seemed a little off to have to make a request to the CN every time a bag is created (and took a noticeable amount of time when I tested it) , so I'm open to caching suggestions.

DataPackage::write_to_bag

Note that this method shouldn't be exposed to users so I followed the pattern of not exporting it.

This logic was originally lifted from this bit of code. Since we now write system metadata, the EML, etc, I decided to pull it out into a new method. If wanted, I can put this in a lambda to keep people from accessing it.

The arguments to this method are probably them most confusing part of this PR. The reason for the two flags....

System metadata are processed differently than ordinary data files so there's a flag to dictate that behaviour. The science metadata is like system metadata in the sense that it belongs in the tagmanifest, but also acts like a data object in the sense that it uses the same method, getData, to get the bytes. Duck typing SystemMetadata to follow the DataObject would fix this, but I think the current workaround isn't horribly messy.

Unit Tests

The original serialize test was written in the context of provenance (variable names, prov was added, etc). I removed this code from the test since it had no effect on the actual bag. I also renamed some of the variables to reflect this. I beefed up the testing by checking the contents of the tagmanifest & manifest files to make sure they match what's in the payload & tag directories.

gothub commented 4 years ago

@ThomasThelen you might consider using the RDF relationships in the resource map to determine what the metadata object is for a package. The metadata object will be the subject of the documents relationship, or the object of the isDocumentedBy relationship, for example, with the metadata document with pid doi:10.6085/AA/marine_ltm.9.1:

  <rdf:Description rdf:about="https://cn.dataone.org/cn/v1/resolve/doi:10.6085%2FAA%2Fmarine_ltm.9.1">
    <ore:isAggregatedBy rdf:datatype="http://www.w3.org/2001/XMLSchema#anyURI">https://cn.dataone.org/cn/v1/resolve/resourceMap_marine_ltm.9.1#aggregation</ore:isAggregatedBy>
    <dcterms:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">doi:10.6085/AA/marine_ltm.9.1</dcterms:identifier>
    <j.0:documents rdf:resource="https://cn.dataone.org/cn/v1/resolve/doi:10.6085%2FAA%2Fmarine_ltm.10.1"/>
  </rdf:Description>

and

  <rdf:Description rdf:about="https://cn.dataone.org/cn/v1/resolve/doi:10.6085%2FAA%2Fmarine_ltm.10.1">
    <ore:isAggregatedBy rdf:datatype="http://www.w3.org/2001/XMLSchema#anyURI">https://cn.dataone.org/cn/v1/resolve/resourceMap_marine_ltm.9.1#aggregation</ore:isAggregatedBy>
    <dcterms:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">doi:10.6085/AA/marine_ltm.10.1</dcterms:identifier>
    <j.0:isDocumentedBy rdf:resource="https://cn.dataone.org/cn/v1/resolve/doi:10.6085%2FAA%2Fmarine_ltm.9.1"/>
  </rdf:Description>

This should work with any DataONE metadata format.

btw - searching for metadata objects with formatId='eml://ecoinformatics' misses EML 2.2.0 objects, as the new formatId for EML 2.2.0 has been updated to 'https://eml.ecoinformatics.org/eml-2.2.0'.

How would you like to proceed with the PR?

ThomasThelen commented 4 years ago

@ThomasThelen you might consider using the RDF relationships in the resource map to determine what the metadata object is for a package. The metadata object will be the subject of the documents relationship, or the object of the isDocumentedBy relationship, for example, with the metadata document with pid doi:10.6085/AA/marine_ltm.9.1:

  <rdf:Description rdf:about="https://cn.dataone.org/cn/v1/resolve/doi:10.6085%2FAA%2Fmarine_ltm.9.1">
    <ore:isAggregatedBy rdf:datatype="http://www.w3.org/2001/XMLSchema#anyURI">https://cn.dataone.org/cn/v1/resolve/resourceMap_marine_ltm.9.1#aggregation</ore:isAggregatedBy>
    <dcterms:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">doi:10.6085/AA/marine_ltm.9.1</dcterms:identifier>
    <j.0:documents rdf:resource="https://cn.dataone.org/cn/v1/resolve/doi:10.6085%2FAA%2Fmarine_ltm.10.1"/>
  </rdf:Description>

and

  <rdf:Description rdf:about="https://cn.dataone.org/cn/v1/resolve/doi:10.6085%2FAA%2Fmarine_ltm.10.1">
    <ore:isAggregatedBy rdf:datatype="http://www.w3.org/2001/XMLSchema#anyURI">https://cn.dataone.org/cn/v1/resolve/resourceMap_marine_ltm.9.1#aggregation</ore:isAggregatedBy>
    <dcterms:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">doi:10.6085/AA/marine_ltm.10.1</dcterms:identifier>
    <j.0:isDocumentedBy rdf:resource="https://cn.dataone.org/cn/v1/resolve/doi:10.6085%2FAA%2Fmarine_ltm.9.1"/>
  </rdf:Description>

This should work with any DataONE metadata format.

btw - searching for metadata objects with formatId='eml://ecoinformatics' misses EML 2.2.0 objects, as the new formatId for EML 2.2.0 has been updated to 'https://eml.ecoinformatics.org/eml-2.2.0'.

How would you like to proceed with the PR?

I noticed that those relationships aren't always present in the resource map. For example, in the unit tests I see resource maps that look like this. I see that addMember adds the relationship only if addMember is supplied the science metadata with the data object. This could be a workflow issue on my part, but I only see a few places in the code that supplies mo.

If the SPARQL query fails I think we're out of solid ways for checking what the science metadata is so I'd continue serving the bag withoutit. Edit: After thinking a little more... If the user never specifies what the science metadata is, then I don't think there's any reason to stress about writing it to the metadata/ folder when serializing to the bag.

ThomasThelen commented 4 years ago

I reworked the previous commit so that any science metadata is written to the metadata folder. @gothub and @amoeba this should be ready for another review.

I'll note that I haven't been able to get rid of the following warning, which needs to be taken care of before merging. librdf warning - Variable o was bound but is unused in the query Which is coming from ` queryString <- 'PREFIX cito: http://purl.org/spar/cito/ SELECT ?s WHERE { ?s cito:documents ?o . }'

I don't want to use ?o, but need to have a placeholder for it for the query to make sense. IF you have any suggestions I'm open.

amoeba commented 4 years ago

Re:

librdf warning - Variable o was bound but is unused in the query

AFAICT this is not a SPARQL requirement and may just be a spurious thing rdflib does here. I would probably just bind the variable and handle it. Alternatively, if you think it's appropriate to apply here, R has a facility called suppressWarnings() which can be useful (sparingly) in situations like this.

ThomasThelen commented 4 years ago

Re:

librdf warning - Variable o was bound but is unused in the query

AFAICT this is not a SPARQL requirement and may just be a spurious thing rdflib does here. I would probably just bind the variable and handle it. Alternatively, if you think it's appropriate to apply here, R has a facility called suppressWarnings() which can be useful (sparingly) in situations like this.

Fixed in 8a6391a. I originally tried selecting ?o however post-processing became ugly with the XPATH that selects relevant data. In the commit referenced, I just check that ?o is also documented by ?s, which should be true.

ThomasThelen commented 4 years ago

Should tidy up docs for relativeFilePath

You might be looking at an old commit (relativeFilePath was changed to a different name). The slot and param docs are using the same language used in that area of the source ("a character string", etc). I can definitely change it but it might look out of place.

I did update some of the documentation in 2d14c63, mostly overhauling the export section to make the usage of targetPath more explicit.

Indicate relativeFilePath is optional in docs

Fixed in 90add9b

relativePath is taken in as an arg with no validation. Do we want to sanitize it? i.e., what happens when I provide a values like C:\Foo, /Users/me/bar, or ../../baz?

Fixed in 9152a4c and 8223a499203e82a2c9a68cfd6c8b0dc88597dbd9

I couldn't find a library that handles Path validation so I went with a blacklist approach...

this is bit unclear and has a typo. Phrases like "unnecessary folders" don't readily make sense to me, though I think I know what you mean. Re-working this along with the documentation for the relativeFilePath arg might make this more clear.

When I refactored the vignette I removed this section and kept the targetPath information contained to the Exporting section. I gave a few examples of what good and bad paths look like which should make it more clear to a reader.

New function names like write_to_bag are snake case but existing functions and methods are camel-cased. I think it'd be best to stay consistent within the package.

Fixed! Must have been writing some Python that day ;)