ropensci / datapack

An R package to handle data packages
https://docs.ropensci.org/datapack
44 stars 9 forks source link

Create a new DataObject that references an existing object in DataONE #119

Closed gothub closed 3 years ago

gothub commented 4 years ago

It should be possible to easily create a DataObject for an object that has already been uploaded to DataONE. It is currently possible for different DataONE packages to include the same object, but a mechanism to create a DataObject, DataPackage that could contain a mix of new local objects and existing DataONE objects is not implemented.

This issue has been raised in other context, but the root of the solution should probably begin with an update to DataObject. Potential solutions:

Other issues that are related to this:

cboettig commented 4 years ago

Thanks @gothub , this would be awesome. In general, I'd love to see a native constructor for DataObject, all the documentation shows the use of new("DataObject", ...) and I believe that's not considered best practice (and makes it harder to import the method by namespace). This also 'hides' the S4 internals from the user. (likewise it would be good if there were accessor methods to avoid users needing @, but I think you largely do that already!)

I'd love to have the function datapack::DataObject(...) as the preferred mechanism. Then the syntax for constructing new objects from local files would be unchanged.

Good question about what the syntax for connecting to an existing DataObject should be. The idea of confirming if the object exists by checksum would be brilliant, but I think not practical (DataONE simply doesn't enforce unique checksums, only unique ids). Other options:

gothub commented 3 years ago

@cboettig @mbjones

For creating DataObjects that refer to a DataONE pid, I propose:

do <- new("DataObject", id="urn:uuid:4dc4a896-31c2-4185-b1d7-ebb37f3f9cd6", reference=TRUE);

The additional parameter reference=TRUE is needed because it is an error to create a DataObject without specifying data to include in it. This param makes it clear that this empty object is a reference. This also sets a slot for this object so that operations on this DataObject can be done intelligently downstream, for example when a DataPackage that contains this object is uploaded to DataONE. If you can recommend a better name for this param, please do.

The 'dataone' package is being updated to indicate which package members are currently in DataONE, for example:

 dp
Members:

filename                    format                      mediaType  size     identifier                                    modified local in DataONE 
strix-pacific-northwest.xml eml://ecoinf...rg/eml-2.1.1 NA         22840    urn:uuid:206ece99-1a14-48a5-9def-4947c999e7ad y        y     n     
Strix-occidentalis-obs.csv  text/csv                    NA         1227     urn:uuid:279ca903-5942-4f9e-9c2b-19acc76f2b3d n        y     y     
OwlNightj.csv               text/csv                    NA         165963   urn:uuid:4dc4a896-31c2-4185-b1d7-ebb37f3f9cd6 n        n     y     
filterObs.R                 application/R               tex...rsrc 554      urn:uuid:75be9163-f05b-42a5-91c8-a95eb9a6fbec n        n     y   
mbjones commented 3 years ago

@gothub @cboettig This use case seems almost identical to the use case that we already support with dataURL as the input parameter for the constructor. From the code documentation:

dataURL A character string containing a URL to remote data (a repository) that this DataObject represents.

This table from the code docs shows where the object expects its data to live based on which parameters are provided to the constructor:

dataUrl filename dataobj comment
Y N N used for lazy loaded DataObjects, 'dataUrl' is the data source
N Y Y 'dataobj' is the data source, 'filename' is sysmeta.filename (download filename)
N Y N 'filename' is the data source, 'filename' is sysmeta.filename
N N Y Invalid, if 'dataobj' is specified, 'filename' must also be specified.

Can you elaborate on what we need that differs from what is already provided by lazy loading data from a dataURL? One thing I can see is I think we are lacking good user-facing documentation of the use of this feature.

gothub commented 3 years ago

@mbjones @cboettig yes, good point. I considered using id with the setting from the first row in the table above, however, I see these drawbacks:

With 'id' plus 'reference' the user has explicitly stated that the id is for an existing DataONE object, so no assumptions have to be made regarding what the user wishes to do.

If you think that the extra param is unneeded or confusing, the 'id' plus first row of params could certainly be made to work, and documented to clear potential confusion.

gothub commented 3 years ago

@cboettig @mbjones btw, another method for creating a DataObject that references a DataONE pid is to lazy load the DataObject

outputObj <- getDataObject(d1c, id="urn:uuid:279ca903-5942-4f9e-9c2b-19acc76f2b3d", lazyLoad=T, limit="1GB", quiet=F)
dp <- addMember(dp, outputObj, metadataObj)

This creates a DataObject that contains system metadata, and has the 'dataUrl' defined. This method has the benefit of insuring that the user has access to the data they are including in a newly composed package. The downside of course is that the sysmeta has to be downloaded.

mbjones commented 3 years ago

Yeah, that is very close to what we need, @gothub, and would probably work for @cboettig. if we could omit the limit and quiet params, it would be quite easy to include an existing object in a new package based solely on its existing identifier.

gothub commented 3 years ago

The default for limit is "1GB". This may have been chosen to satisfy a particular use case. In hindsight, a better option may have been no limit if the param is not included. Should this be changed? The quiet param is optional.

gothub commented 3 years ago

The 'lazyLoad' behavior was changed in the dataone package and explained in issue https://github.com/DataONEorg/rdataone/issues/258.

No new parameters were added to DataObject initialization, and the workflow to create a DataObject that references an existing DataONE object is the one described above in the Aug 31 comments.

cboettig commented 3 years ago

Thanks @gothub , this is fantastic! Just to be clear, I should now do:

outputObj <- getDataObject(d1c,
                                             id="urn:uuid:279ca903-5942-4f9e-9c2b-19acc76f2b3d", 
                                             lazyLoad=T)

to access an existing object (without downloading it)

gothub commented 3 years ago

@cboettig Yes, that is the correct call.

For a summary of this argument behavior for the dataone 2.2.0 release, please skip to the very bottom of this issue