There are currently 60220 litcovid entries of which 371 are missing name field and 1488 are missing url field.
Sample of entries missing required properties:
_id name url
pmid32514951 NaN https://www.doi.org/10.1007/s15006-020-0579-4
pmid32401447 NaN NaN
pmid32833365 NaN NaN
pmid32266709 NaN NaN
pmid32401449 NaN NaN
pmid32559400 NaN https://www.doi.org/10.1089/cyber.2020.29190.cfp
pmid32556025 NaN https://www.doi.org/10.36416/1806-3756/e20200190
pmid32803989 NaN https://www.doi.org/10.1089/ham.2020.0133
pmid32433296 NaN NaN
pmid32433298 NaN NaN
Cause of missing names:
Normally the title is stored as an ArticleTitle in this xml, but for entries with foreign titles, they may be stored as a VernacularTitle
Causing of missing urls:
The parser generates the urls based on doi's. If an entry doesn't have a doi, it won't have a url. This can be solved by using the url generated for the curatedBy property (curatedBy.url)
In this case, the foreign language title is still stored as title but is prefixed by trans. The problem with this format is that it seems to be their own brand of json and lacks any sort of colons : making it pretty much unusable without the biopython library.
In this example, the title is stored properly under ArticleTitle; however, the term In silico is italicized using html tags (<i></i>) this means that when converting from xml to json, it needs to go a level down the tree.
There are currently 60220 litcovid entries of which 371 are missing
name
field and 1488 are missingurl
field.Sample of entries missing
required
properties:Current API query returns and xml file: example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&rettype=abstract&id=32833365
Cause of missing names: Normally the title is stored as an
ArticleTitle
in this xml, but for entries with foreign titles, they may be stored as aVernacularTitle
Causing of missing urls: The parser generates the urls based on doi's. If an entry doesn't have a doi, it won't have a url. This can be solved by using the url generated for the
curatedBy
property (curatedBy.url
)Note that it is possible to retrieve json-ish docs via the entrez api as well: example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=32833365
In this case, the foreign language title is still stored as
title
but is prefixed bytrans
. The problem with this format is that it seems to be their own brand of json and lacks any sort of colons:
making it pretty much unusable without the biopython library.Additional parsing issues causing missing
name
(and may be the source of the malformed json files too): -Since the json file is parsed from an xml file, html tags used for formatting content may interfere with the parsing of the xml to the json. example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&rettype=abstract&id=32269766In this example, the title is stored properly under
ArticleTitle
; however, the termIn silico
is italicized using html tags (<i></i>
) this means that when converting from xml to json, it needs to go a level down the tree.