Open nleguillarme opened 5 years ago
I too ran into this. The article titled "Premorbid IQ varies across different definitions of schizophrenia" returns .pubmed_id '17342225\n10435610\n1638332\n15474902\n14302768\n9403903\n16297601\n5009428\n6382590\n12597613\n3292568\n16221995\n10986554\n16946869\n1182406\n12414070\n16330717\n15066893\n16484093\n1931805\n10678506\n9223148\n16639153\n4752222\n10442433\n12379446'
This is due to how getContent is parsing the XML. Looking at @M0rtenB 's example in XML, the Author's of "Premorbid IQ ..." seem to have included all the pubMed ID's for their citations.
`
<Reference>
<Citation>Br J Psychiatry. 1992 Jul;161:69-74</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">1638332</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Schizophr Res. 2004 Dec 1;71(2-3):323-30</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">15474902</ArticleId>
</ArticleIdList>`
Most article's will only have a small articleID snippet (not every article ID for citations) which will look like this:
`
<ArticleId IdType="pmc">PMC1805734</ArticleId>
</ArticleIdList> `
article.py
is using getContent() from helpers.py
to grab this from the xml. getContent uses element.findall(path) to grab the results, then joins those results into a string broken by new lines (what you're seeing).
We could probably change _extractPubMedID to use
path = ".//PMID"
instead of path = ".//ArticleId[@IdType='pubmed']"
, and I think that would work. Not sure if there's other gotchas in that solution though.
@nleguillarme your example also uses citation articleIDs
I too ran into this.
@gijswobben @nleguillarme I made a pull request for this issue. Basically following @mbullmanFHCRC suggestions, actually.
While iterating on articles resulting from a PubMed query, I noticed that some article ids have parsing issues.
For instance : Query : ((Haliaeetus leucocephalus[Title/Abstract])) AND ((prey[Title/Abstract]) OR (diet[Title/Abstract]))
Returns (when printing first 10 results) : pubmed_id = '22822430\n18959310\n21310968\n21295371\n20439737' abstract = ('Bald eagles (Haliaeetus leucocephalus) are recovering from severe population declines...