hammerlab / t-cell-data

https://tcelldata.hammerlab.org
6 stars 1 forks source link

Pull publication metadata from Entrez E-utilities #17

Closed hammer closed 5 years ago

hammer commented 5 years ago

Docs: E-utilities Quick Start

R client: https://github.com/gschofl/reutils (uses R5/Reference classes object system, yikes)

Example efetch query for a single PMID: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=22368089. Returns some weird thing that looks like typed JSON but isn't so I don't know to parse it; going to have to work w/ XML, sadly.

hammer commented 5 years ago

Oh there's also https://github.com/ropensci/rentrez. They don't know how to parse efetch's weird JSON-ish data either.

hammer commented 5 years ago

For extra fun we could try to pull down full text, cf. https://www.ncbi.nlm.nih.gov/pmc/tools/get-full-text/ and https://www.ncbi.nlm.nih.gov/pmc/tools/ftp.

armish commented 5 years ago

Returns some weird thing that looks like typed JSON but isn't so I don't know to parse it; going to have to work w/ XML, sadly.

I used to parse the clinical trials XML archive from NCI and all I can say is even if you can parse the XML, there will be lots of weird edge cases to its syntax. I ended up using one of the standard XML-to-JSON solutions and always worked with the converted JSON to stay sane. There was a nice xml2json node library that could do this about five years ago. I assume we would have even better ones now and IMHO that would be the way to go.