titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
586 stars 168 forks source link

Function `parse_medline_xml` only return the first affiliation if author has multiple affiliations. #160

Closed ZhangWoW123 closed 2 weeks ago

ZhangWoW123 commented 3 weeks ago

Hi team,

Describe the bug I encountered another issue when using the package to extract PubMed affiliation information from XML files. When author has multiple affiliations, the parse_medline_xml function will only extract the first affiliation.

To Reproduce An example of this issue is PMID 39029952. In the XML file, the section is structured as follows. Each author has multiple affinations

<AuthorList CompleteYN="Y">
  <Author ValidYN="Y">
    <LastName>Kim</LastName>
    <ForeName>Jennifer</ForeName>
    <Initials>J</Initials>
    <AffiliationInfo>
      <Affiliation>Graduate Program in Neuroscience, University of British Columbia, Vancouver, Canada.</Affiliation>
    </AffiliationInfo>
    <AffiliationInfo>
      <Affiliation>Djavad Mowafaghian Centre for Brain Health, Vancouver, Canada.</Affiliation>
    </AffiliationInfo>
  </Author>
  ...
</AuthorList>

The medline_parser.parse_author_affiliation use author.find("AffiliationInfo/Affiliation") to find the affilation infromation. However, the find will only return one object (i.e. first element). Thus, the first affiliation is returned.

Expected behavior I expect the parser to return all author's affiliations as a list. Might consider changing the author.find("AffiliationInfo/Affiliation").text with list(chain(*([c.text] for c in author.findall("AffiliationInfo/Affiliation"))))?

Screenshots

Screenshot 2024-10-24 at 11 04 33 PM

XML file example pmid_39029952.txt

Thank you all for the great support.

Michael-E-Rose commented 2 weeks ago

I expect the parser to return all author's affiliations as a list. Might consider changing the author.find("AffiliationInfo/Affiliation").text with list(chain(*([c.text] for c in author.findall("AffiliationInfo/Affiliation"))))?

This looks like the solution to me. Do you want to provide a PR? Ideally with an additional test case in https://github.com/titipata/pubmed_parser/blob/master/tests/test_medline_parser.py#L33.

ZhangWoW123 commented 2 weeks ago

@Michael-E-Rose,

Sure, I created the PR https://github.com/titipata/pubmed_parser/pull/162

Michael-E-Rose commented 2 weeks ago

Thank you for your service!