ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

entrez_summary doesn't like nested XML in JSON or XML(sampledata) #82

Closed tomck closed 7 years ago

tomck commented 8 years ago
cp460 <- entrez_summary(db="biosample", id="2886856", rettype="JSON")
> cp460$sampledata
[1] "&lt;BioSample submission_date=\"2014-06-26T08:08:36.203\" last_update=\"2015-08-10T23:52:20.737\" publication_date=\"2014-06-26T08:08:36.203\" access=\"public\" id=\"2886856\"
[snip]

Above is a testcase, below is how it looks direct: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=biosample&id=2886856&version=2.0&retmode=json

(very confusing that for eutils it's retmode but rentrez it's rettype, but I digress)

Expected results: ability to take things like cp460$sampledata$latitude and cp460$sampledata$longitude Actual results: XML as escaped text

dwinter commented 8 years ago

Thanks for the report @tomck ,

retmode and rettype are both valid eutils arguments, I don't think the rentrez docs discuss the difference, which might be a useful thing to do.

For this use case, it's very hard to make this an "automatic" part on entrez_summary because there is no indication which fields from which databases are going to have these XML-ish entries. And the XML itself is hard to parse cleanly.

On the other hand, we could either document a workaround or provide a function to handle thise.

I'll think about that, in the mean time, in this specific case I think the escaped XML is actually what get's returned by efetch. So you could do this to get the results you are interested in:

cp460 <- entrez_fetch(db="biosample", id=2886856, rettype="xml", parsed=TRUE)
lat <- XML::xmlValue(cp460[["//Attribute[@attribute_name='latitude']"]])
lon <- XML::xmlValue(cp460[["//Attribute[@attribute_name='longitude']"]])
c(lat, lon)
[1] "40.79108"  "-73.96178"
dwinter commented 7 years ago

After a mere ... year, i have at last provided a little documentation page on how to deal with these records. https://github.com/ropensci/rentrez/wiki/Parse-html--information-within-esummary-records

Closing now