Open cboettig opened 11 years ago
@cboettig Your incantation should work, but rather than use the generic text/xml
type, it would be better to specifically type it with the version of EML that you are using, such as:
eml_file <- new(Class="D1Object", id2, "myeml.xml", "eml://ecoinformatics.org/eml-2.1.1", mn_nodeid)
You can get a list of the controlled format IDs from the DataONE formats service: https://cn.dataone.org/cn/v1/formats
To associate the CSV and EML files, you create a DataPackage, and add both the EML files and CSV files to the package, and indicate that the EML file documents
the CSV files by setting a relationship property. As there is only one such property, the R client provides a mechanism for this using the insertRelationship()
method. For example:
d1Object <- new(Class="D1Object", id, csvdata, format, mn_nodeid)
setPublicAccess(d1Object)
# Create a metadata object and make it public as well
metadata <- paste(readLines("test.xml"), collapse = '')
format.mta <- "eml://ecoinformatics.org/eml-2.1.1"
d1o.md1 <- new("D1Object", id.mta, metadata, format.mta, mn_nodeid)
setPublicAccess(d1o.md1)
# Assemble our data package containing both metadata and data
data.package <- new("DataPackage", packageId=id.pkg)
addData(data.package,d1Object)
addData(data.package,d1o.md1)
insertRelationship(data.package, id.mta, c(id))
# Now upload the whole package to the member node
create(d1.client, data.package)
The DataPackage
is represented by an OAI-ORE ResourceMap, which is an RDF document listing the contents to be aggregated together and some properties about those contents. Here's an example ResourceMap that gets created for reference purposes:
https://knb.ecoinformatics.org/knb/d1/mn/v1/object/resourceMap_6000141086_2.3.4
When the EML is uploaded, most of it is indexed for search and discovery. What else do you mean in terms of 'extracted'?
@mbjones Perfect, thanks!
Right, I meant indexed. Very cool that the indexing of the EML happens automatically -- I assume the format.mta
information triggers this?
Can you clarify what fields are indexed how? That is, what fields just become part of a full-text search vs which can be searched by specific type, (such as the "spatial/temporal/taxonomic context" nodes)?
One more question -- can you clarify the use of the ids in the upload? For instance, I see you generate a random id for the simple CSV file example. It appears that the KNB client returns a different identifier when it registers the data? I also understand that I can get a DOI from the KNB/dataONE, but not sure how to do that? Also, as far as the EML data-package example, it seems these identifiers should correspond to ids
in the EML, but I'm not quite sure how.
@mbjones By the way, complete side note here: that controlled formats list you liked to is kinda inspiring! Um, I do notice a few formats that might be nice to have, kind of wondering what the process of adding formats is? (In particular, I see application/x-python and application/mathematica but nothing obviously for R scripts or R packages. Likewise, I see the nexus/1997
format for phylogenies, but not the richer nexml format. Or is it undesirable to proliferate these classes too much?
Yes, if the format is set to one of the DataONE recognized METADATA types, then it will index it. The formats service I listed previously indicates which formats are METADATA and which are DATA from DataONE's perspective. You can get a list of the indexed fields from DataONE from the query service, using a URL like: https://cn.dataone.org/cn/v1/query/solr When indexing content, DataONE maps fields from different metadata standards (e.g., EML, FGDC, ISO 19139, etc) onto that common solr schema. So there is a bit of interpretation there in the mapping. This process is described in more detail here: http://mule1.dataone.org/ArchitectureDocs-current/design/SearchMetadata.html
The client should return the same identifier you passed in on create() -- it shouldn't be different. The ids are set by the client, with the only significant rule being that they need to be globally unique (minus some details about character escaping, length, and non-printing characters). To help with that, Member Nodes are able to provide some identifier generation services which will generate and/or reserve an identifier for a particular client. The KNB node implements a service to generate UUIDs and one to generate DOIs. Unfortunately, I haven't exposed this functionality in the R Client yet -- needs to be added. Here's how you would get an ID in curl, assuming you'd already logged into CILogon to generate your client certificate:
curl -E /tmp/x509up_u501 -F scheme=UUID https://knb.ecoinformatics.org/knb/d1/mn/v1/generate
If you switch scheme=UUID
to scheme=DOI
, you'll generate a DOI, but please don't use this for testing -- we don't want to issue a bunch of bogus DOIs. We have a separate test service if you want to set up for testing. If the DOI is used in a create()
call on an EML document, then Metacat will automatically register the metadata with that DOI on EZID and make sure the DOI resolves to the right location.
Once you've decided on your IDs, these should go in the EML document as well. The overall identifier for the EML document goes in the packageId
attribute on the root eml
element. The individual IDs for the data files should go in the entity section for that data object, typically under the id
attribute for the entity, and/or under the online/url
element, in which case you should use a URI format for the identifier. For example:
<eml:eml
xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
xmlns:stmml="http://www.xml-cml.org/schema/stmml"
xmlns:ds="eml://ecoinformatics.org/dataset-2.1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://sbc.lternet.edu/external/EML/schemas/eml-2.1.1/eml.xsd"
system="knb"
scope="system"
packageId="urn:uuid:d4c22d88-bf30-47dc-a866-fb6cf47ec636">
<access authSystem="knb" order="denyFirst" scope="document">
<allow>
<principal>public</principal>
<permission>read</permission>
</allow>
</access>
<dataset scope="document">
<shortName>Test Dataset</shortName>
<title>TESTDATA: Test data set for uploading content, should be ignored</title>
...
<dataTable id="urn:uuid:706a990d-71f0-4b37-a9d9-72b6051771bd" scope="system">
<entityName>mohawk_mooring_mko.txt</entityName>
<entityDescription>Moored data from Mohawk Outside Spar (station = mko) CTD, ADCP VSF, 3
thermistors (2005-01-01 - 2008-01-15)</entityDescription>
<physical scope="document">
<objectName>mohawk_mooring_mko.txt</objectName>
<size unit="bytes">67632386</size>
...
<distribution>
<online>
<url function="download">https://knb.ecoinformatics.org/knb/d1/mn/v1/object/urn:uuid:706a990d-71f0-4b37-a9d9-72b6051771bd</url>
</online>
</distribution>
</physical>
...
</dataTable>
</dataset>
...
</eml:eml>
@cboettig The formats listed were just a starter set, and we have been adding new formats to the list whenever they are needed. We have an API for that, but it is restricted to a small group of people so that we have consensus on what to use to identify different formats. If you have a list of suggested formats, you can request that they be added by emailing the suggestion to 'developers@dataone.org', or by submitting a ticket to the DataONE Ticket System. Proposals get reviewed for redundancy, and to be sure the proposed ID is the best match for the format, which is a bit subjective because some standards are not clear on this. And then they are added.
@mbjones so awesome!
@cboettig FYI, I created a feature request Ticket #4127 for adding support for MNStorage.generateIdentifer()
to the R dataone
package. It shouldn't be hard to do, so let me know if this is holding you back. The rest of the issues for the next release of the dataone
R package are listed on our backlogs page.
@mbjones Revisiting this -- now that I have authentication working I thought I'd go ahead and add KNB as a publish option for reml (see https://github.com/ropensci/reml/issues/20, https://github.com/ropensci/reml/issues/23).
Looks like the example script just uploads the CSV file directly, https://github.com/mbjones/opensci_r_esa_2013/blob/master/dataone-r/dataone-write-data.R Wondering what I need to do to upload an EML file? Is it just changing:
to
Or do I need a different Class, etc?
Probably best if you could provide an example uploading an EML file that I can then wrap into a
publish_knb
function? (As we discussed in https://github.com/ropensci/reml/issues/23, this would just be a helper routine for thepublish_eml
function).