titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
584 stars 167 forks source link

Can't extract article title from nested XML file #158

Open ZhangWoW123 opened 4 days ago

ZhangWoW123 commented 4 days ago

Thank you for developing and maintaining the pubmed_parser package. This is a great help to may pubmed related analysis.

Describe the bug I encountered an issue when using the package to extract PubMed information from XML files. Sometime, the article title is missing from the output, even though it exists in the source XML file.

To Reproduce An example of this issue is PMID 39029957. In the XML file, the <ArticleTitle> section is structured as follows:

<ArticleTitle>
  <b>
    <b>OKN-007 is an Effective Anticancer Therapeutic Agent Targeting Inflammatory and Immune Metabolism Pathways in Endometrial Cancer.
    </b>
  </b>
</ArticleTitle>

When using medline_parser.parse_article_info, it calls the utils.stringify_children function, which only extracts the current layer and first layer of children. Since the title is within the second layer, the parsed title is empty.

Here is the code being executed:

import pandas as pd
import pubmed_parser as pp

filename = 'pubmed24n1476.xml.gz'

parsed_articles = pp.parse_medline_xml(
    filename,
    year_info_only=True,
    nlm_category=True,
    author_list=True
)
df = pd.DataFrame.from_dict(parsed_articles)
df[df["pmid"] == 39029957]

The xml file for this pmid is in pubmed24n1476.xml.gz file and can be downloaded from https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/

Expected behavior I expect the function to extract the correct title from the XML. A temporary solution is modifying the utils.stringify_childrenfunction by replacing return "".join(filter(None, parts)) with return ''.join(root.xpath('.//text()')).strip(). However, I am unsure if this will cause other issues.

Screenshots Here is the screenshot for the XML source file.

Screenshot 2024-10-23 at 10 55 55 PM
nils-herrmann commented 4 days ago

Thank you for the thorough documentation of the bug @ZhangWoW123 !

The current implementation and proposed solution for stringify_children can be found in StackOverflow. Although both approaches give similar results, there are minor differences:

To avoid braking other functions like parse_pubmed_caption() let's create a new function stringify_descendants().

Michael-E-Rose commented 4 days ago

To avoid braking other functions like parse_pubmed_caption() let's create a new function stringify_descendants().

I wonder whether that's necessary. Can we not generalize parse_article_info() to include one more level? What would it break elsewhere?

nils-herrmann commented 4 days ago

@Michael-E-Rose We could use stringify_descendants() for the other fields in parse_article_info(). In parse_pubmed_caption() the new function would parse not only the fig_caption but the fig_list-items (which we don't want):

<caption>
  <title>Aerosol delivery of sACE2<sub>2</sub>.v2.4&#x02010;IgG1 alleviates
      lung injury and improves survival of SARS&#x02010;CoV&#x02010;2 gamma
      variant infected K18&#x02010;hACE2 transgenic mice</title>
  <p>
      <list list-type="simple" id="emmm202216109-list-0002">
          <list-item id="emmm202216109-li-0004">
              <label>A</label>
              <p>K18&#x02010;hACE2 transgenic mice were inoculated with
...
</caption>
ZhangWoW123 commented 4 days ago

Thank you for the thorough documentation of the bug @ZhangWoW123 !

The current implementation and proposed solution for stringify_children can be found in StackOverflow. Although both approaches give similar results, there are minor differences:

  • The current implementation only extracts text from the children of the node
  • The proposed solution extract text from all descendants of the node

To avoid braking other functions like parse_pubmed_caption() let's create a new function stringify_descendants().

Thank you so much @nils-herrmann for the great help! Looking forward to the updated package:)

Michael-E-Rose commented 12 hours ago

I still don't get it. Why can we not change stringify_children()? What would break?

Given the complexity of the current codebase, no new function is strictly preferable.

nils-herrmann commented 8 hours ago

As seen above, parse_pubmed_caption() breaks because it does not only parse the children (i.e. <title> text) but also its descendants text (i.e. <list-item>).

Michael-E-Rose commented 8 hours ago

I thought the problem is, that parse_article_info() doesn't parse enough, and not too much?

Let me ask the other way around: @ZhangWoW123 suggested this:

A temporary solution is modifying the utils.stringify_childrenfunction by replacing return "".join(filter(None, parts)) with return ''.join(root.xpath('.//text()')).strip(). However, I am unsure if this will cause other issues.

What other issues may it cause, and can we prevent them by changing existing functions?