gijswobben / pymed

PyMed is a Python library that provides access to PubMed.
MIT License
191 stars 111 forks source link

Articles id parsing issue #22

Open nleguillarme opened 5 years ago

nleguillarme commented 5 years ago

While iterating on articles resulting from a PubMed query, I noticed that some article ids have parsing issues.

For instance : Query : ((Haliaeetus leucocephalus[Title/Abstract])) AND ((prey[Title/Abstract]) OR (diet[Title/Abstract]))

Returns (when printing first 10 results) : pubmed_id = '22822430\n18959310\n21310968\n21295371\n20439737' abstract = ('Bald eagles (Haliaeetus leucocephalus) are recovering from severe population declines...

M0rtenB commented 5 years ago

I too ran into this. The article titled "Premorbid IQ varies across different definitions of schizophrenia" returns .pubmed_id '17342225\n10435610\n1638332\n15474902\n14302768\n9403903\n16297601\n5009428\n6382590\n12597613\n3292568\n16221995\n10986554\n16946869\n1182406\n12414070\n16330717\n15066893\n16484093\n1931805\n10678506\n9223148\n16639153\n4752222\n10442433\n12379446'

mbullmanFHCRC commented 5 years ago

This is due to how getContent is parsing the XML. Looking at @M0rtenB 's example in XML, the Author's of "Premorbid IQ ..." seem to have included all the pubMed ID's for their citations. `

Arch Gen Psychiatry. 1999 Aug;56(8):749-54 10435610
        <Reference>
            <Citation>Br J Psychiatry. 1992 Jul;161:69-74</Citation>
            <ArticleIdList>
                <ArticleId IdType="pubmed">1638332</ArticleId>
            </ArticleIdList>
        </Reference>
        <Reference>
            <Citation>Schizophr Res. 2004 Dec 1;71(2-3):323-30</Citation>
            <ArticleIdList>
                <ArticleId IdType="pubmed">15474902</ArticleId>
            </ArticleIdList>`

Most article's will only have a small articleID snippet (not every article ID for citations) which will look like this: `

17342225
        <ArticleId IdType="pmc">PMC1805734</ArticleId>
    </ArticleIdList> `

article.py is using getContent() from helpers.py to grab this from the xml. getContent uses element.findall(path) to grab the results, then joins those results into a string broken by new lines (what you're seeing).

We could probably change _extractPubMedID to use path = ".//PMID" instead of path = ".//ArticleId[@IdType='pubmed']", and I think that would work. Not sure if there's other gotchas in that solution though.

mbullmanFHCRC commented 5 years ago

@nleguillarme your example also uses citation articleIDs

iacopy commented 4 years ago

I too ran into this.

iacopy commented 4 years ago

@gijswobben @nleguillarme I made a pull request for this issue. Basically following @mbullmanFHCRC suggestions, actually.