titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
588 stars 168 forks source link

Drop all_paragraph from parse_pubmed_paragraph() #147

Closed mattyding closed 3 months ago

mattyding commented 4 months ago

Describe the bug We find that pubmed_parser.parse_pubmed_paragraph frequently omits paragraphs that are present in the XML representation. This error has occurred with every XML file that we have tested, and we have seen examples where as many as 14 paragraphs are missing from the generated output.

To Reproduce Attached to this bug report is the XML file for PMC example ID pmc-PMC548513 (in txt format because github doesn't accept xml uploads). Rename the extension to ".xml" and run the following minimal code snippet.

import pubmed_parser  # version 0.4.0

with open("PMC548513.xml") as f:
    xml_str = f.read()

paragraphs = pubmed_parser.parse_pubmed_paragraph(xml_str)
text = ""
current_section_title = ""
if not paragraphs:
    return None
for paragraph in paragraphs:
    section_title = paragraph["section"]
    if section_title != current_section_title and section_title.strip():
        text += "\n\n## " + section_title
        current_section_title = section_title
    text += "\n\n" + paragraph["text"].strip()

print(text)

The resulting output misses the following paragraphs (you can compare to source here):

This example has 14 missing paragraphs, with at least one missing from each section. Such errors are present in every XML files that we've tested. You can reproduce with any such file.

Expected behavior The function should not omit paragraphs from the data.

Screenshots

Dependencies pubmed_parser version version 0.4.0

Additional context

nils-herrmann commented 3 months ago

Dear @mattyding, thank you for your message. The missing paragraphs have no references and are therefore ignored. Use the all_paragraph=True argument as described in the documentation. Please let us know if this solves your problem.

Kind regards

Michael-E-Rose commented 3 months ago

Why is it the default, to not parse all paragraphs, @titipata ? The documentation says "to aviod noisy parsed text", but what is that?

In any case, we should update the documentation. This is not proper English:

A boolean indicating if you want to include paragraph with no references made or not if True, include all paragraphs if False, include only paragraphs that have references default: False

titipata commented 3 months ago

Agree @Michael-E-Rose. I think we should definitely rewrite the documentation! I think I wrote it a while ago and it does not make sense!

Michael-E-Rose commented 3 months ago

But what was the reading behind all_paragraph=False?

mattyding commented 3 months ago

Thanks for the response. I missed that part of the documentation. I personally think it is more intuitive to parse all paragraphs by default and to opt-in to skipping paragraphs without references.

Michael-E-Rose commented 3 months ago

Do you think there are use cases where users would want to skip paragraphs without references? What are paragraphs without references anysways?

Michael-E-Rose commented 3 months ago

Alright, then let's simply drop this option and always parse all the paragraphs. I can't think of a use-case where fewer paragraphs are desired, while the reasons for all_paragraph=False are lost in history. @nils-herrmann , would you please do that?