Closed skinkie closed 3 years ago
So if you know exactly what piece of the xml you actually need, parse it directly :love_you_gesture:
lxml has an api to turn an element or tree to an event stream.
from typing import Any
from typing import Union
from lxml import etree
from xsdata.formats.dataclass.parsers.handlers import LxmlEventHandler
from xsdata.models.enums import EventType
EVENTS = (EventType.START, EventType.END, EventType.START_NS)
class LxmlTreeWalkerHandler(LxmlEventHandler):
__slots__ = ()
def parse(self, source: Union[etree.Element, etree.ElementTree]) -> Any:
"""Parse directly an lxml element or tree."""
ctx = etree.iterwalk(source, EVENTS)
return self.process_context(ctx)
import lxml
tree = lxml.etree.parse(sample)
parser = XmlParser(context=context, config=config, handler=LxmlTreeWalkerHandler)
versions = parser.parse(tree.find('.//{http://www.netex.org.uk/netex}versions'), VersionsRelStructure)
print(versions)
VersionsRelStructure(id=None, modification_set=<ModificationSetEnumeration.ALL: 'all'>, version_ref=[], version=[Version(name_of_class_attribute=None, id='HTM:Version:_2020-10-12', validity_conditions=None, valid_between=[], alternative_texts=None, data_source_ref_attribute=None, created=None, changed=None, modification=<ModificationEnumeration.NEW: 'new'>, version='_2020-10-12', status_attribute=<StatusEnumeration.ACTIVE: 'active'>, derived_from_version_ref_attribute=None, compatible_with_version_frame_version_ref=None, derived_from_object_ref=None, key_list=None, extensions=None, branding_ref=None, responsibility_set_ref_attribute=None, start_date=XmlDateTime(2020, 10, 8, 0, 0, 0, 0, 0), end_date=XmlDateTime(2020, 10, 25, 0, 0, 0, 0, 0), status=None, description=None, version_type=<VersionTypeEnumeration.BASELINE: 'baseline'>, type_of_version_ref=None, derived_from_version_ref=None)])
what's bugs me a lot is that the lxml iterwalk seems faster than iterparse for the whole document
check #531 I think it's better to integrate this as a feature in the existing LxmlEventHandler
Thanks for the suggestions @skinkie keep them coming
https://xsdata.readthedocs.io/en/latest/xml.html#parse-from-lxml-element-or-tree
In #476 we have established that the performance of huge XSDs in combination with larger XML are limited by single threaded processing. I wonder if the example below could be executed more elegantly, by directly using the LXML Element without the need for serialisation to string again. Hence could LxmlSaxHandler or LxmlEventHandler be convinced to start with an Element?
What I would like to explore is below, it seems to me that this is already so close to the lxml stuff...