ropensci / datapack

An R package to handle data packages
https://docs.ropensci.org/datapack
44 stars 9 forks source link

support nested objects in DataPackages #109

Closed mbjones closed 2 years ago

mbjones commented 5 years ago

This is an enhancement request to support file hierarchies in the R datapack implementation compatible with other implementations of the DataONE data packaging specification, and bring it into conformance with the RDA recommendation as described in https://github.com/DataONEorg/api-documentation/issues/5

Currently DataPackage consists of a flat collection of DataObject instances. A proposal has been made to include hierarchies of objects within folders to more easily allow DataPackages to mirror file system hierarchies. Within DataPackage, this requires that we record the location of objects within the nested folder hierarchy, relative to the root of the data package. Here is an example folder hierarchy that a researcher might want to encode in their data package:

package
├── inputs
│   ├── input1.csv
│   └── input2.csv
├── outputs
│   └── output1.csv
├── table1.csv
└── table2.csv

This hierarchy would be encoded by recording the relative path of each object in the hierarchy in a prov:atLocation property in the OAI-ORE ResourceMap for the package. Thus, we need to add methods for adding and removing these prov:atLocation properties from the ResourceMap. They would be serialized in the resource map as follows:

<prov:atLocation>outputs/output1.csv</prov:atLocation>

It's important to note that the path is relative to the data package root, and does not include file:// nor the data directory that is part of the bag.

In addition, we need to serialize and de-serialize these object hierarchies when creating importing and exporting versions of packages, currently in BagIt format. A BagIt bag that contains a DataPackage would serialize the objects of the package within the data directory of the bag, and would treat the data directory as the root of the file hierarchy for the package. The package hierarchy shown above would be placed within a BagIt bag as follows:

bag/
├── bagit.txt
├── data
│   ├── inputs
│   │   ├── input1.csv
│   │   └── input2.csv
│   ├── outputs
│   │   └── output1.csv
│   ├── table1.csv
│   └── table2.csv
└── manifest-sha256.txt

See the proposal on modifying the DataONE package specifiction for more details and discussion.

gothub commented 5 years ago

I see two use cases that would need to be supported - 1. building a package from scratch and 2. downloading a package from DataONE to create a DataPackage in R. This may involve updates to addMember, getMember, insertRelationships - for use case 1.

For use case 2, the rdataone package methods uploadDataPackage and getDataPackage would need updating.

Are there other use cases that we would support?

ThomasThelen commented 4 years ago

Setting the file path prior to upload

One way to do this in datapack is to create a user friendly method that wraps the following

 dataPackage <- insertRelationship(dp, filePatj,
     "http://www.w3.org/ns/prov#atLocation",
     someObjectID)

Call the method setFilePath or something something similar. Tools that use this package (such as rdataone) can then call this method in some preferred way.

If this library is meant to be fairly general, any tooling that uses it can easily create its own method that does the work. For example, we could instead just modify rdataone.addMember to take a file path, which would in turn call datapack.DataPackage.insertRelationship.

At the end of the day, datapack.insertRelationship is going to have to be called somewhere.

Any @mbjones or @gothub - any opinions on whether or not I add a method to the datapack DataPackage class: addFilePath(objectId, path), or instead modify addMember in rdataone to take an optional file path and call datapack.DataPackage.insertRelationship from there?

mbjones commented 4 years ago

@ThomasThelen I think that modifying a DataPackage is in the scope of datapack, and if you want a utility method for setting prov:atLocation, then it would be good to be placed in datapack so that anyone creating a package can call it. datapack::describeWorkflow is s similar utility method.

ThomasThelen commented 4 years ago

@mbjones I'm currently writing the R implementation and I'd like to outline a few workflows to make sure what I'm doing makes sense and to gather feedback on how this is exposed.

Use Case: User adds a DataObject to a DataPackage

This is one of the most basic workflows. Consider a user that wants to get a bagit representation of some files on their disk. They decide to use datapack to...

  1. Create a DataPackage
  2. Export it as a bag to disk

The code to do this without any sort of file path adding would look something like...

library(datapack)
dp <- new("DataPackage")
# Add the script to the DataPackage
progFile <- system.file("./extdata/pkg-example/logit-regression-example.R", package="datapack")
progObj <- new("DataObject", format="application/R", filename=progFile)
dp <- addMember(dp, progObj)
serializeToBagIt(dp)

The goal is to introduce a non-invasive way of letting the user describe the path of logit-regression-example.R. In this case, the progFile variable has a nicely formed path: It doesn't include a letter drive, the user's home directory, or other non-relevant paths (ie not Desktop/cat-memes/extdata/pkg-example/....).

Using the path from system.file

It would be great to just grab the path in progFile, and add it as the prov:atLocation field. The case of nicely formed paths however, is not always the case and it should be assumed that people are going to be using full paths that contain drive letters, home folders, etc. Problems also arise when thinking about how to automatically parse the path.

We can certainly automatically clean C:/tthelen/top-secret-experiments/github/datapack/myData.csv to tthelen/top-secret-experiments/github/datapack/myData.csv however, there's no way to tell that the desired path is datapack/myData.csv.

Adding an Additional Field to the DataObject Constructor

Another approach that is most likely less error prone is asking the user to add relative path when creating the DataObject.

new("DataObject", format="application/R", filename=progFile, filepath=./extdata/pkg-example/logit-regression-example.R)

Use Case: rdataone User Creates a Package

This use case follows the same general pattern as the first one. In both cases the user will be interacting with datapack to create a DataObject and then adds it to a DataPackage.

ThomasThelen commented 4 years ago

I've created a draft pull request, linked above, with the changes. @mbjones, @gothub, @jeanetteclark if you could take a look at that and let me know if there are any additional changes that that need to be made I'd appreciate it!

ThomasThelen commented 4 years ago

Ready for review.