ValueError when attempting to parse OA XML

mazzespazze commented 1 year ago

Describe the bug I downloaded the XML gz file "oa_comm_xml.incr.2023-06-20.tar.gz" you can find here: https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/oa_comm/xml/.

Full link: https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/oa_comm/xml/oa_comm_xml.incr.2023-06-20.tar.gz.

Python code:

tar = tarfile.open(fileobj=fileobj)
        for i, member in enumerate(tar.getmembers()):
            f = tar.extractfile(member)
            stream = ""
            if f is not None:
                try:
                    content = f.read().decode("utf-8")
                    stream += content
                except UnicodeError:
                    continue
            pmc_dict = pp.parse_pubmed_xml(stream)

Error: tree = etree.fromstring(path) File "src/lxml/etree.pyx", line 3254, in lxml.etree.fromstring File "src/lxml/parser.pxi", line 1908, in lxml.etree._parseMemoryDocument ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

To Reproduce Try to get to parse the file I put as a link with parse_pubmed_xml.

Expected behavior I was expecting a dictionary as in the other cases.

Screenshots Screenshot from 2023-08-28 19-50-31

Dependencies The ones on this package + tarfile and gzip.

Additional context I want to parse each XML in https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/oa_comm/xml/

raypereda-gr commented 1 year ago

Thanks for the details for reproducing the problem with code and data. Not all of the file in the …/xml/ folder are the same.

head -2 *.xml
==> PMC9933422.xml <==
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article

==> PMC9942033.xml <==
<!DOCTYPE article
PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD with MathML3 v1.3 20210610//EN" "JATS-archivearticle1-3-mathml3.dtd">

I recommend un-tarring and then iterating on the XML files produced. The file inside the tar file triggering the error is PMC9933422.xml

# import json
import pubmed_parser as pp

# treat this file as XML fragment
filename = 'PMC9942033.xml'
file = open(filename, "rb")
content = file.read().decode("utf-8")
pmc_dict = pp.parse_pubmed_xml(content)
first_element = next(iter(pmc_dict))
print(f'{pmc_dict[first_element]}') 
# output: Tuberculosis in older adults: case studies from four countries with rapidly ageing populations in the western pacific region
# print(json.dumps(pmc_dict, indent=4))

# treat this file as a complete XML file
filename = 'PMC9933422.xml'
pmc_dict = pp.parse_pubmed_xml(filename)
first_element = next(iter(pmc_dict))
print(f'{pmc_dict[first_element]}')
# output: Interventions for myopia control in children: a living systematic review and network meta‐analysis

Hopefully this helps.

mazzespazze commented 1 year ago

I am now using a work-around where in case of exception, I write the full xml into a file. And then I parse them later.

As I cannot really afford to "unpack" all the tar files due to space constraints. Is there a way to give the file while still being within the tar.gz?

raypereda-gr commented 1 year ago

I see how space constraints are driving your approach. The in-memory members from the archive acts different than an XML file.

See here for an example unarchiving.

titipata / pubmed_parser

ValueError when attempting to parse OA XML #126