Can't extract article title from nested XML file

ZhangWoW123 commented 1 month ago

Thank you for developing and maintaining the pubmed_parser package. This is a great help to may pubmed related analysis.

Describe the bug I encountered an issue when using the package to extract PubMed information from XML files. Sometime, the article title is missing from the output, even though it exists in the source XML file.

To Reproduce An example of this issue is PMID 39029957. In the XML file, the <ArticleTitle> section is structured as follows:

<ArticleTitle>
  <b>
    <b>OKN-007 is an Effective Anticancer Therapeutic Agent Targeting Inflammatory and Immune Metabolism Pathways in Endometrial Cancer.
    </b>
  </b>
</ArticleTitle>

When using medline_parser.parse_article_info, it calls the utils.stringify_children function, which only extracts the current layer and first layer of children. Since the title is within the second layer, the parsed title is empty.

Here is the code being executed:

import pandas as pd
import pubmed_parser as pp

filename = 'pubmed24n1476.xml.gz'

parsed_articles = pp.parse_medline_xml(
    filename,
    year_info_only=True,
    nlm_category=True,
    author_list=True
)
df = pd.DataFrame.from_dict(parsed_articles)
df[df["pmid"] == 39029957]

The xml file for this pmid is in pubmed24n1476.xml.gz file and can be downloaded from https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/

Expected behavior I expect the function to extract the correct title from the XML. A temporary solution is modifying the utils.stringify_childrenfunction by replacing return "".join(filter(None, parts)) with return ''.join(root.xpath('.//text()')).strip(). However, I am unsure if this will cause other issues.

Screenshots Here is the screenshot for the XML source file.

nils-herrmann commented 1 month ago

Thank you for the thorough documentation of the bug @ZhangWoW123 !

The current implementation and proposed solution for stringify_children can be found in StackOverflow. Although both approaches give similar results, there are minor differences:

The current implementation only extracts text from the children of the node
The proposed solution extract text from all descendants of the node

To avoid braking other functions like parse_pubmed_caption() let's create a new function stringify_descendants().

Michael-E-Rose commented 1 month ago

To avoid braking other functions like parse_pubmed_caption() let's create a new function stringify_descendants().

I wonder whether that's necessary. Can we not generalize parse_article_info() to include one more level? What would it break elsewhere?

nils-herrmann commented 1 month ago

@Michael-E-Rose We could use stringify_descendants() for the other fields in parse_article_info(). In parse_pubmed_caption() the new function would parse not only the fig_caption but the fig_list-items (which we don't want):

<caption>
  <title>Aerosol delivery of sACE2<sub>2</sub>.v2.4&#x02010;IgG1 alleviates
      lung injury and improves survival of SARS&#x02010;CoV&#x02010;2 gamma
      variant infected K18&#x02010;hACE2 transgenic mice</title>
  <p>
      <list list-type="simple" id="emmm202216109-list-0002">
          <list-item id="emmm202216109-li-0004">
              <label>A</label>
              <p>K18&#x02010;hACE2 transgenic mice were inoculated with
...
</caption>

ZhangWoW123 commented 1 month ago

Thank you for the thorough documentation of the bug @ZhangWoW123 !

The current implementation and proposed solution for stringify_children can be found in StackOverflow. Although both approaches give similar results, there are minor differences:

The current implementation only extracts text from the children of the node

The proposed solution extract text from all descendants of the node

To avoid braking other functions like parse_pubmed_caption() let's create a new function stringify_descendants().

Thank you so much @nils-herrmann for the great help! Looking forward to the updated package:)

Michael-E-Rose commented 1 month ago

I still don't get it. Why can we not change stringify_children()? What would break?

Given the complexity of the current codebase, no new function is strictly preferable.

nils-herrmann commented 1 month ago

As seen above, parse_pubmed_caption() breaks because it does not only parse the children (i.e. <title> text) but also its descendants text (i.e. <list-item>).

Michael-E-Rose commented 1 month ago

I thought the problem is, that parse_article_info() doesn't parse enough, and not too much?

Let me ask the other way around: @ZhangWoW123 suggested this:

A temporary solution is modifying the utils.stringify_childrenfunction by replacing return "".join(filter(None, parts)) with return ''.join(root.xpath('.//text()')).strip(). However, I am unsure if this will cause other issues.

What other issues may it cause, and can we prevent them by changing existing functions?

nils-herrmann commented 2 weeks ago

Original problem: parse_article_info() does not parse enough. The reason is that stringify_children() only gets the text of the children.
Proposed solution: Use ''.join(root.xpath('.//text()')).strip() in stringify_children() which gets the text of all descendants.
Problem of proposed solution: In parse_pubmed_caption() we are interested in getting the text of the children not the descendants, i.e. the proposed solution gets too much text.
We need two functions because we want two different things: Getting text of children or getting the text of the descendants.

Michael-E-Rose commented 2 weeks ago

When would I want only the children, but not the other descendants? How often does it happen actually that there are children and descendants?

If I understand the example from OP correctly, then the nested title is kind of an anomaly.

nils-herrmann commented 2 weeks ago

We want only children in parse_pubmed_caption() because we parse the caption title (children) separately from the caption list items (descendants). Besides that case we always want the descendants. We can change the code to parse only the caption title and use stringify_descendants() for that function too.

Yes, the nested title is an anomaly.

titipata / pubmed_parser

Can't extract article title from nested XML file #158