kbrbe / beltrans-data-integration

Creating a FAIR Linked Data corpus for the BELTRANS research project about Belgian book translations NL-FR and FR-NL between 1970 and 2020
https://www.kbr.be/en/projects/beltrans/
MIT License
5 stars 0 forks source link

Parsing XML in a streaming fashion leads to unexpected results when clearing the root to save up RAM #274

Closed SvenLieber closed 2 months ago

SvenLieber commented 2 months ago

When iterating in a streaming fashion over an XML file with lxml.iterparse we clear the element after processing it to save up RAM. Similarly we also clear the root element, because we still ran out of RAM when processing huge files (see also https://stackoverflow.com/questions/12160418/why-is-lxml-etree-iterparse-eating-up-all-my-memory).

However, after processing a number of records issues with namespaces seem to emerge when clearing the root. Instead of records with a default namespace xmlns="http://www.loc.gov/MARC21/slim" declared in the record element (and no further namespace usage in the datafield and subfield child elements), we get new namespaces with every record. On top of that, in the middle of a record some datafields don't use the namespace anymore and the code leads to unexpected results. For example, when checking if marc:datafield[@tag="075"] exist we get True for some records, but False for some others, because in some records it is datafield and in some it is ns1234:datafield.

When clearing the root after each record, the namespace declarations in the output look like the following:

<ns:198:record xmlns:ns198="http://www.loc.gov/MARC21/slim"> <ns:198:datafield> .. </datafield> <datafield> ... </datafield> ... </record>

<ns:199:record xmlns:ns199="http://www.loc.gov/MARC21/slim"> ... </record>

Please note how in the first one datafields with and without namespace occur.

One workaround when looking for elements is to first look for marc: datafield and if not found look for datafield. However, this is highly use-case specific and becomes problematic if more datafield elements with different namespaces are used.

Another way to fix the issue is to NOT clear the root element. However, then we run into RAM issues for huge files (we have a single collection with many records).

We verified the above with 3 MARC records that all are parsed correctly if they are the sole input and fail if they are part of a large XML file where we get many namespaces after a while. When keeping the root the parsing error does not occur.

SvenLieber commented 2 months ago

The solution in the stackoverflow post mentions the following (instead of root.clear():

        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]

I tested it and it works. Correct results and no increasing RAM with a large file (APEP person dump of KBR).