outbreak-info / litcovid

parser for LitCOVID Publications
1 stars 3 forks source link

missing required fields like 'name' or 'url' #18

Closed gtsueng closed 3 years ago

gtsueng commented 3 years ago

There are currently 60220 litcovid entries of which 371 are missing name field and 1488 are missing url field.

Sample of entries missing required properties:

_id         name                                              url
pmid32514951    NaN     https://www.doi.org/10.1007/s15006-020-0579-4   
pmid32401447    NaN                                               NaN   
pmid32833365    NaN                                               NaN   
pmid32266709    NaN                                               NaN   
pmid32401449    NaN                                               NaN   
pmid32559400    NaN  https://www.doi.org/10.1089/cyber.2020.29190.cfp   
pmid32556025    NaN  https://www.doi.org/10.36416/1806-3756/e20200190   
pmid32803989    NaN         https://www.doi.org/10.1089/ham.2020.0133   
pmid32433296    NaN                                               NaN   
pmid32433298    NaN                                               NaN 

Current API query returns and xml file: example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&rettype=abstract&id=32833365

Cause of missing names: Normally the title is stored as an ArticleTitle in this xml, but for entries with foreign titles, they may be stored as a VernacularTitle

Causing of missing urls: The parser generates the urls based on doi's. If an entry doesn't have a doi, it won't have a url. This can be solved by using the url generated for the curatedBy property (curatedBy.url)

Note that it is possible to retrieve json-ish docs via the entrez api as well: example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=32833365

In this case, the foreign language title is still stored as title but is prefixed by trans. The problem with this format is that it seems to be their own brand of json and lacks any sort of colons : making it pretty much unusable without the biopython library.

Additional parsing issues causing missing name (and may be the source of the malformed json files too): -Since the json file is parsed from an xml file, html tags used for formatting content may interfere with the parsing of the xml to the json. example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&rettype=abstract&id=32269766

In this example, the title is stored properly under ArticleTitle; however, the term In silico is italicized using html tags (<i></i>) this means that when converting from xml to json, it needs to go a level down the tree.