Closed mattyding closed 3 months ago
Dear @mattyding, thank you for your message. The missing paragraphs have no references and are therefore ignored. Use the all_paragraph=True
argument as described in the documentation. Please let us know if this solves your problem.
Kind regards
Why is it the default, to not parse all paragraphs, @titipata ? The documentation says "to aviod noisy parsed text", but what is that?
In any case, we should update the documentation. This is not proper English:
A boolean indicating if you want to include paragraph with no references made or not if True, include all paragraphs if False, include only paragraphs that have references default: False
Agree @Michael-E-Rose. I think we should definitely rewrite the documentation! I think I wrote it a while ago and it does not make sense!
But what was the reading behind all_paragraph=False
?
Thanks for the response. I missed that part of the documentation. I personally think it is more intuitive to parse all paragraphs by default and to opt-in to skipping paragraphs without references.
Do you think there are use cases where users would want to skip paragraphs without references? What are paragraphs without references anysways?
Alright, then let's simply drop this option and always parse all the paragraphs. I can't think of a use-case where fewer paragraphs are desired, while the reasons for all_paragraph=False
are lost in history.
@nils-herrmann , would you please do that?
Describe the bug We find that
pubmed_parser.parse_pubmed_paragraph
frequently omits paragraphs that are present in the XML representation. This error has occurred with every XML file that we have tested, and we have seen examples where as many as 14 paragraphs are missing from the generated output.To Reproduce Attached to this bug report is the XML file for PMC example ID pmc-PMC548513 (in txt format because github doesn't accept xml uploads). Rename the extension to ".xml" and run the following minimal code snippet.
The resulting output misses the following paragraphs (you can compare to source here):
This example has 14 missing paragraphs, with at least one missing from each section. Such errors are present in every XML files that we've tested. You can reproduce with any such file.
Expected behavior The function should not omit paragraphs from the data.
Screenshots
Dependencies pubmed_parser version version 0.4.0
Additional context