Closed behrica closed 7 years ago
thanks for this @behrica - on initial look i see what you mean. the raw XML has a bunch of structure to it. I think the payload
field in the output of get_records("oai:zenodo.org:159890", url="https://zenodo.org/oai2d", prefix="oai_datacite")
has the entire content under the payload
tag in the XML, whereas ideally that should be parsed while retaining structure
will look into this, but not sure yet how many different formats of records there will be, if there's lots it may not be feasible to provide custom parsers for each one. though for the common ones we could
I checked a bit the docu and what zenodo.org does
It is true that via the 'metadaPrefix' mechanism a OAI-PMH compliant repository, can return records in any format, and it is indeed questionable which and how this package should support them.
'oai_dc' is mandatory to be supported, so this library should indeed try to first fully support this format.
But as the current api allows to pass an arbitrary metadataPrefix, it gives the impression that it will work with any metadata format.
So maybe it should be documented that the current focus is on 'oai_dc' format only.
I can already now use the api by using the as='raw' parameter and parse the xml myself.
Just one more comment.
I believe that zenodo.org is / will be an important data repository and the docu says, that its 'prefered format' is datacite. See here: https://zenodo.org/dev#harvest-metadata Therefore datacite should be supported on mid/long term.
working on the oai_dc
parser now, will ping when it's up
@behrica try again after reinstalling devtools::install_github("ropensci/oai")
Works now.
Thanks.
glad it works
We are using zenodo.org as a oai compatibe repository. Zenodo encodes certain "subjects" in this format in the xml:
This type of subjetcs do not appear in the data.frame, if I call list_records, like this:
I need to choose the "oai_datacite" prefix, if not the server does not return the records at all.
So by using prefix="oai_datacite" I can see that they get returned by using the "raw" option, like this:
But the the later parsing omits them somehow. I looked at the code and debuged it, but could not find where excactly they get lost.