Open ZhangWoW123 opened 1 month ago
Thank you for the thorough documentation of the bug @ZhangWoW123 !
The current implementation and proposed solution for stringify_children
can be found in StackOverflow. Although both approaches give similar results, there are minor differences:
To avoid braking other functions like parse_pubmed_caption()
let's create a new function stringify_descendants()
.
To avoid braking other functions like
parse_pubmed_caption()
let's create a new functionstringify_descendants()
.
I wonder whether that's necessary. Can we not generalize parse_article_info()
to include one more level? What would it break elsewhere?
@Michael-E-Rose We could use stringify_descendants()
for the other fields in parse_article_info()
. In parse_pubmed_caption()
the new function would parse not only the fig_caption
but the fig_list-items
(which we don't want):
<caption>
<title>Aerosol delivery of sACE2<sub>2</sub>.v2.4‐IgG1 alleviates
lung injury and improves survival of SARS‐CoV‐2 gamma
variant infected K18‐hACE2 transgenic mice</title>
<p>
<list list-type="simple" id="emmm202216109-list-0002">
<list-item id="emmm202216109-li-0004">
<label>A</label>
<p>K18‐hACE2 transgenic mice were inoculated with
...
</caption>
Thank you for the thorough documentation of the bug @ZhangWoW123 !
The current implementation and proposed solution for
stringify_children
can be found in StackOverflow. Although both approaches give similar results, there are minor differences:
- The current implementation only extracts text from the children of the node
- The proposed solution extract text from all descendants of the node
To avoid braking other functions like
parse_pubmed_caption()
let's create a new functionstringify_descendants()
.
Thank you so much @nils-herrmann for the great help! Looking forward to the updated package:)
I still don't get it. Why can we not change stringify_children()
? What would break?
Given the complexity of the current codebase, no new function is strictly preferable.
As seen above, parse_pubmed_caption()
breaks because it does not only parse the children (i.e. <title>
text) but also its descendants text (i.e. <list-item>
).
I thought the problem is, that parse_article_info()
doesn't parse enough, and not too much?
Let me ask the other way around: @ZhangWoW123 suggested this:
A temporary solution is modifying the
utils.stringify_children
function by replacingreturn "".join(filter(None, parts))
withreturn ''.join(root.xpath('.//text()')).strip()
. However, I am unsure if this will cause other issues.
What other issues may it cause, and can we prevent them by changing existing functions?
Original problem: parse_article_info()
does not parse enough. The reason is that stringify_children()
only gets the text of the children.
Proposed solution: Use ''.join(root.xpath('.//text()')).strip()
in stringify_children()
which gets the text of all descendants.
Problem of proposed solution: In parse_pubmed_caption()
we are interested in getting the text of the children not the descendants, i.e. the proposed solution gets too much text.
We need two functions because we want two different things: Getting text of children or getting the text of the descendants.
When would I want only the children, but not the other descendants? How often does it happen actually that there are children and descendants?
If I understand the example from OP correctly, then the nested title is kind of an anomaly.
We want only children in parse_pubmed_caption()
because we parse the caption title (children) separately from the caption list items (descendants). Besides that case we always want the descendants.
We can change the code to parse only the caption title and use stringify_descendants()
for that function too.
Yes, the nested title is an anomaly.
Thank you for developing and maintaining the
pubmed_parser
package. This is a great help to may pubmed related analysis.Describe the bug I encountered an issue when using the package to extract PubMed information from XML files. Sometime, the article title is missing from the output, even though it exists in the source XML file.
To Reproduce An example of this issue is PMID
39029957
. In the XML file, the<ArticleTitle>
section is structured as follows:When using
medline_parser.parse_article_info
, it calls theutils.stringify_children
function, which only extracts the current layer and first layer of children. Since the title is within the second layer, the parsed title is empty.Here is the code being executed:
The xml file for this pmid is in
pubmed24n1476.xml.gz
file and can be downloaded from https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/Expected behavior I expect the function to extract the correct title from the XML. A temporary solution is modifying the
utils.stringify_children
function by replacingreturn "".join(filter(None, parts))
withreturn ''.join(root.xpath('.//text()')).strip()
. However, I am unsure if this will cause other issues.Screenshots Here is the screenshot for the XML source file.