mbjones / opensci_r_esa_2013

Conducting Open Science using R and DataONE: A Hands­ on Primer
4 stars 1 forks source link

Uploading EML to the KNB? #4

Open cboettig opened 10 years ago

cboettig commented 10 years ago

@mbjones Revisiting this -- now that I have authentication working I thought I'd go ahead and add KNB as a publish option for reml (see https://github.com/ropensci/reml/issues/20, https://github.com/ropensci/reml/issues/23).

Looks like the example script just uploads the CSV file directly, https://github.com/mbjones/opensci_r_esa_2013/blob/master/dataone-r/dataone-write-data.R Wondering what I need to do to upload an EML file? Is it just changing:

csv_object <- new(Class="D1Object", id, "csvdata.csv", "txt/csv", mn_nodeid)

to

eml_file <- new(Class="D1Object", id2, "myeml.xml", "txt/xml", mn_nodeid)

Or do I need a different Class, etc?

Probably best if you could provide an example uploading an EML file that I can then wrap into a publish_knb function? (As we discussed in https://github.com/ropensci/reml/issues/23, this would just be a helper routine for the publish_eml function).

mbjones commented 10 years ago

@cboettig Your incantation should work, but rather than use the generic text/xml type, it would be better to specifically type it with the version of EML that you are using, such as:

eml_file <- new(Class="D1Object", id2, "myeml.xml", "eml://ecoinformatics.org/eml-2.1.1", mn_nodeid)

You can get a list of the controlled format IDs from the DataONE formats service: https://cn.dataone.org/cn/v1/formats

To associate the CSV and EML files, you create a DataPackage, and add both the EML files and CSV files to the package, and indicate that the EML file documents the CSV files by setting a relationship property. As there is only one such property, the R client provides a mechanism for this using the insertRelationship() method. For example:

d1Object <- new(Class="D1Object", id, csvdata, format, mn_nodeid)
setPublicAccess(d1Object)

# Create a metadata object and make it public as well
metadata <- paste(readLines("test.xml"), collapse = '')
format.mta <- "eml://ecoinformatics.org/eml-2.1.1"
d1o.md1 <- new("D1Object", id.mta, metadata, format.mta, mn_nodeid)
setPublicAccess(d1o.md1)

# Assemble our data package containing both metadata and data
data.package <- new("DataPackage", packageId=id.pkg)
addData(data.package,d1Object)
addData(data.package,d1o.md1)
insertRelationship(data.package, id.mta, c(id))

# Now upload the whole package to the member node
create(d1.client, data.package)

The DataPackage is represented by an OAI-ORE ResourceMap, which is an RDF document listing the contents to be aggregated together and some properties about those contents. Here's an example ResourceMap that gets created for reference purposes: https://knb.ecoinformatics.org/knb/d1/mn/v1/object/resourceMap_6000141086_2.3.4

When the EML is uploaded, most of it is indexed for search and discovery. What else do you mean in terms of 'extracted'?

cboettig commented 10 years ago

@mbjones Perfect, thanks!

Right, I meant indexed. Very cool that the indexing of the EML happens automatically -- I assume the format.mta information triggers this?

Can you clarify what fields are indexed how? That is, what fields just become part of a full-text search vs which can be searched by specific type, (such as the "spatial/temporal/taxonomic context" nodes)?

One more question -- can you clarify the use of the ids in the upload? For instance, I see you generate a random id for the simple CSV file example. It appears that the KNB client returns a different identifier when it registers the data? I also understand that I can get a DOI from the KNB/dataONE, but not sure how to do that? Also, as far as the EML data-package example, it seems these identifiers should correspond to ids in the EML, but I'm not quite sure how.

cboettig commented 10 years ago

@mbjones By the way, complete side note here: that controlled formats list you liked to is kinda inspiring! Um, I do notice a few formats that might be nice to have, kind of wondering what the process of adding formats is? (In particular, I see application/x-python and application/mathematica but nothing obviously for R scripts or R packages. Likewise, I see the nexus/1997 format for phylogenies, but not the richer nexml format. Or is it undesirable to proliferate these classes too much?

mbjones commented 10 years ago

Yes, if the format is set to one of the DataONE recognized METADATA types, then it will index it. The formats service I listed previously indicates which formats are METADATA and which are DATA from DataONE's perspective. You can get a list of the indexed fields from DataONE from the query service, using a URL like: https://cn.dataone.org/cn/v1/query/solr When indexing content, DataONE maps fields from different metadata standards (e.g., EML, FGDC, ISO 19139, etc) onto that common solr schema. So there is a bit of interpretation there in the mapping. This process is described in more detail here: http://mule1.dataone.org/ArchitectureDocs-current/design/SearchMetadata.html

The client should return the same identifier you passed in on create() -- it shouldn't be different. The ids are set by the client, with the only significant rule being that they need to be globally unique (minus some details about character escaping, length, and non-printing characters). To help with that, Member Nodes are able to provide some identifier generation services which will generate and/or reserve an identifier for a particular client. The KNB node implements a service to generate UUIDs and one to generate DOIs. Unfortunately, I haven't exposed this functionality in the R Client yet -- needs to be added. Here's how you would get an ID in curl, assuming you'd already logged into CILogon to generate your client certificate:

curl -E /tmp/x509up_u501 -F scheme=UUID https://knb.ecoinformatics.org/knb/d1/mn/v1/generate

If you switch scheme=UUID to scheme=DOI, you'll generate a DOI, but please don't use this for testing -- we don't want to issue a bunch of bogus DOIs. We have a separate test service if you want to set up for testing. If the DOI is used in a create() call on an EML document, then Metacat will automatically register the metadata with that DOI on EZID and make sure the DOI resolves to the right location.

Once you've decided on your IDs, these should go in the EML document as well. The overall identifier for the EML document goes in the packageId attribute on the root eml element. The individual IDs for the data files should go in the entity section for that data object, typically under the id attribute for the entity, and/or under the online/url element, in which case you should use a URI format for the identifier. For example:

<eml:eml 
  xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" 
  xmlns:stmml="http://www.xml-cml.org/schema/stmml" 
  xmlns:ds="eml://ecoinformatics.org/dataset-2.1.1" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://sbc.lternet.edu/external/EML/schemas/eml-2.1.1/eml.xsd" 
  system="knb" 
  scope="system"
  packageId="urn:uuid:d4c22d88-bf30-47dc-a866-fb6cf47ec636">
  <access authSystem="knb" order="denyFirst" scope="document">
    <allow>
      <principal>public</principal>
      <permission>read</permission>
    </allow>
  </access>
  <dataset scope="document">
    <shortName>Test Dataset</shortName>
    <title>TESTDATA: Test data set for uploading content, should be ignored</title>
    ...
    <dataTable id="urn:uuid:706a990d-71f0-4b37-a9d9-72b6051771bd" scope="system">
      <entityName>mohawk_mooring_mko.txt</entityName>
      <entityDescription>Moored data from Mohawk Outside Spar (station = mko) CTD, ADCP VSF, 3
        thermistors (2005-01-01 - 2008-01-15)</entityDescription>
      <physical scope="document">
        <objectName>mohawk_mooring_mko.txt</objectName>
        <size unit="bytes">67632386</size>
        ...
        <distribution>
          <online>
            <url function="download">https://knb.ecoinformatics.org/knb/d1/mn/v1/object/urn:uuid:706a990d-71f0-4b37-a9d9-72b6051771bd</url>
          </online>
        </distribution>
      </physical>
      ...
    </dataTable>
  </dataset>
  ...
</eml:eml>
mbjones commented 10 years ago

@cboettig The formats listed were just a starter set, and we have been adding new formats to the list whenever they are needed. We have an API for that, but it is restricted to a small group of people so that we have consensus on what to use to identify different formats. If you have a list of suggested formats, you can request that they be added by emailing the suggestion to 'developers@dataone.org', or by submitting a ticket to the DataONE Ticket System. Proposals get reviewed for redundancy, and to be sure the proposed ID is the best match for the format, which is a bit subjective because some standards are not clear on this. And then they are added.

cboettig commented 10 years ago

@mbjones so awesome!

mbjones commented 10 years ago

@cboettig FYI, I created a feature request Ticket #4127 for adding support for MNStorage.generateIdentifer() to the R dataone package. It shouldn't be hard to do, so let me know if this is holding you back. The rest of the issues for the next release of the dataone R package are listed on our backlogs page.