Deserialising a subtree from lxml to improve performance of huge documents

skinkie commented 3 years ago

In #476 we have established that the performance of huge XSDs in combination with larger XML are limited by single threaded processing. I wonder if the example below could be executed more elegantly, by directly using the LXML Element without the need for serialisation to string again. Hence could LxmlSaxHandler or LxmlEventHandler be convinced to start with an Element?

import lxml
import sys
from xsdata.formats.dataclass.context import XmlContext
from xsdata.formats.dataclass.parsers import XmlParser
from xsdata.formats.dataclass.parsers.config import ParserConfig
from xsdata.formats.dataclass.parsers.handlers import LxmlEventHandler
from netex import VersionsRelStructure

context = XmlContext()
config = ParserConfig(fail_on_unknown_properties=False)
parser = XmlParser(context=context, config=config, handler=LxmlEventHandler)
tree = lxml.etree.parse('NeTEx_HTM_Rail_2021-06-02_new.xml')
versions = parser.from_bytes(lxml.etree.tostring(t.find('.//{http://www.netex.org.uk/netex}versions')), VersionsRelStructure)

What I would like to explore is below, it seems to me that this is already so close to the lxml stuff...

versions = parser.from_element(t.find('.//{http://www.netex.org.uk/netex}versions'), VersionsRelStructure)

tefra commented 3 years ago

So if you know exactly what piece of the xml you actually need, parse it directly :love_you_gesture:

lxml has an api to turn an element or tree to an event stream.

from typing import Any
from typing import Union

from lxml import etree

from xsdata.formats.dataclass.parsers.handlers import LxmlEventHandler
from xsdata.models.enums import EventType

EVENTS = (EventType.START, EventType.END, EventType.START_NS)

class LxmlTreeWalkerHandler(LxmlEventHandler):
    __slots__ = ()

    def parse(self, source: Union[etree.Element, etree.ElementTree]) -> Any:
        """Parse directly an lxml element or tree."""
        ctx = etree.iterwalk(source, EVENTS)
        return self.process_context(ctx)

import lxml

tree = lxml.etree.parse(sample)
parser = XmlParser(context=context, config=config, handler=LxmlTreeWalkerHandler)
versions = parser.parse(tree.find('.//{http://www.netex.org.uk/netex}versions'), VersionsRelStructure)
print(versions)

VersionsRelStructure(id=None, modification_set=<ModificationSetEnumeration.ALL: 'all'>, version_ref=[], version=[Version(name_of_class_attribute=None, id='HTM:Version:_2020-10-12', validity_conditions=None, valid_between=[], alternative_texts=None, data_source_ref_attribute=None, created=None, changed=None, modification=<ModificationEnumeration.NEW: 'new'>, version='_2020-10-12', status_attribute=<StatusEnumeration.ACTIVE: 'active'>, derived_from_version_ref_attribute=None, compatible_with_version_frame_version_ref=None, derived_from_object_ref=None, key_list=None, extensions=None, branding_ref=None, responsibility_set_ref_attribute=None, start_date=XmlDateTime(2020, 10, 8, 0, 0, 0, 0, 0), end_date=XmlDateTime(2020, 10, 25, 0, 0, 0, 0, 0), status=None, description=None, version_type=<VersionTypeEnumeration.BASELINE: 'baseline'>, type_of_version_ref=None, derived_from_version_ref=None)])

tefra commented 3 years ago

what's bugs me a lot is that the lxml iterwalk seems faster than iterparse for the whole document

tefra commented 3 years ago

check #531 I think it's better to integrate this as a feature in the existing LxmlEventHandler

tefra commented 3 years ago

Thanks for the suggestions @skinkie keep them coming

https://xsdata.readthedocs.io/en/latest/xml.html#parse-from-lxml-element-or-tree

tefra / xsdata

Deserialising a subtree from lxml to improve performance of huge documents #530