ropensci / EDIutils

An API Client for the Environmental Data Initiative Repository
https://docs.ropensci.org/EDIutils/
Other
10 stars 2 forks source link

Using search_data_packages only returns YYYY for pubdate instead of YYYY-MM-DD #45

Closed gkamener closed 12 months ago

gkamener commented 1 year ago

I'm experiencing a possible bug when attempting to query FCE package metadata when including the pubdate.

Using the search_data_packages() function to query pubdate only returns the year for that value instead of a date (expecting something including YYYY-MM-DD). In comparison, including begindate or enddate in the same query returns YYYY-MM-DD for those values.

I am using version 1.0.2 of the package with R version 4.2.2.

An example of the script I'm running to query and screenshot from the result is provided below.

library(EDIutils)

query <- search_data_packages(query = 'q=scope:(knb-lter-fce)&fl=doi,title,packageid,begindate,enddate,pubdate')

Screenshot 2023-06-07 163855

clnsmth commented 1 year ago

Thanks for reporting this @gkamener. I'll have a look and get back to you here.

clnsmth commented 1 year ago

Thanks for your patience @gkamener.

EDI is considering a change to the pubdate field of the Search Data Packages API method, but before doing so, and possibly breaking anyones code currently using this result as it currently stands, we'd like to hear more about your particular use case.

Please note, one immediate fix to this issue is to call the Read Metadata Resource Metadata method with the full data package ID (e.g. knb-lter-fce.1076.4) to get the dateCreated field, which contains the value of pubdate but in the YYYY-MM-DD format you are looking for.

gkamener commented 1 year ago

Thank you for reviewing this @clnsmth.

My use case is to utilize metadata from each FCE package in EDI's repository as a validation check against portions of metadata we have for those packages in the FCE database. We use the latter to track the current status and other details for each package, and the metadata returned from search_data_packages has already helped me correct some erroneous enddate values plus other metadata in our database.

Being able to retrieve the most recent pubdate values in the YYYY-MM-DD format for all FCE packages through search_data_packages would be helpful, but I don't think making such changes just for my use case would be worth breaking anyone's code.

Thank you for suggesting read_metadata_resource_metadata, I may look into that as a check to ensure that pubdates from EDI align with what we have in the FCE database.

clnsmth commented 1 year ago

Thanks for this helpful context @gkamener. We'll take this into consideration.

Another API method that may help with your metadata validation use case, is Read Metadata. This returns, the full EML metadata record, in XML, and is the source of information that is indexed and returned through the Search Data Packages method. So, if you are looking for the information via Search Data Packages, you will also find it in the source metadata. Note, the indexed metadata is a considerably smaller subset of the source metadata record.

Now, you may be scratching your head asking "Why would I want to access the publication date through the Read Metadata method just to get the same value I get through Search Data Packages?", well, there is actually a transformation that occurs in the Search Data Packages pathway that you can bypass by reading the EML metadata and parsing the XML to get the <pubDate> element value directly. For example:

> library(EDIutils)
> library(xml2)
> 
> # Read the metadata of a data package and get the publication date
> eml <- read_metadata("knb-lter-fce.1076.4")
> pubdate <- xml_find_all(eml, xpath = ".//dataset/pubDate")
> xml_text(pubdate)
[1] "2019-03-05"
> 
> # While we're at it, get the begin and end dates as well
> begindate <- xml_find_all(eml, xpath = ".//dataset/coverage//.//beginDate/calendarDate")
> xml_text(begindate)
[1] "1998-08-19"
> 
> enddate <- xml_find_all(eml, xpath = ".//dataset/coverage//.//endDate/calendarDate")
> xml_text(enddate)
[1] "2006-12-03"
> 
gkamener commented 1 year ago

Thanks for the suggestion @clnsmth! It's very helpful!

clnsmth commented 12 months ago

Hi @gkamener. Is there anything else I can lend a hand with before closing this issue?

gkamener commented 12 months ago

Hi @clnsmth. I think I'm good. Thanks for the help!