tefra / xsdata

Naive XML & JSON Bindings for python
https://xsdata.readthedocs.io
MIT License
324 stars 59 forks source link

XML parsing approximately 200 times slower than parsing to in-built xml.etree.ElementTree.Element #1057

Closed DareDevilDenis closed 2 months ago

DareDevilDenis commented 3 months ago

Using:

I'd like to ask about the performance of xsdata XML parsing. In my benchmarking I found it to be approximately 200 times slower than parsing to the in-built xml.etree.ElementTree.Element. I was expecting xsdata to be a little slower but this difference seems to be extreme. I tried both XmlEventHandler and LxmlEventHandler and got similar results.

Is this difference expected? If it's expected then I apologise for raising this as an issue.

Test script:

from pathlib import Path
import time
import xml.etree.ElementTree
import my_dataclass
from xsdata.formats.dataclass.parsers import XmlParser
from xsdata.formats.dataclass.parsers.handlers import XmlEventHandler

TEST_ITERATIONS = 2000
my_path = Path(__file__).parent
xml_file = my_path / "input.xml"

def main():
    with xml_file.open() as f:
        file_contents = f.read()

    start_time = time.time()
    using_in_built_element(file_contents, TEST_ITERATIONS)
    end_time = time.time()
    time_using_in_built_element = end_time - start_time
    print("Time using Python xml.etree.ElementTree.Element:", time_using_in_built_element)

    start_time = time.time()
    using_xsdata(file_contents, TEST_ITERATIONS)
    end_time = time.time()
    time_using_xsdata = end_time - start_time
    print("Time using xsdata:", time_using_xsdata)
    print ("Ratio:", time_using_xsdata / time_using_in_built_element)

def using_in_built_element(xml_string, iterations):
    for _ in range(iterations):
        xml_root = xml.etree.ElementTree.fromstring(xml_string)

def using_xsdata(xml_string, iterations):
    parser = XmlParser(handler=XmlEventHandler)
    for _ in range(iterations):
        record_as_obj = parser.from_string(xml_string, my_dataclass.LogRecord)

if __name__ == "__main__":
    main()

My results:

Time using Python xml.etree.ElementTree.Element: 0.2820000648498535
Time using xsdata: 55.9834668636322
Ratio: 198.52288648741876

I've attached this script, "input.xml" and "my_dataclass.py": xsdata_xml_parse_performance.zip

tefra commented 3 months ago

These numbers look awful 😞, but yeah xsdata like most pure python binding libraries will always be slower.

In your case specifically it's way worse because of all the union fields. The parser will attempt to parse the xml node with every given dataclass. Then it will take the "successful" attempts and return the one with highest score. This is a very crude process and unfortunately it's very slow.

    field_value: List[
        Union[
            TrtApiVersion,
            TfpgaApiVersion,
            Tuxmid,
            TcellSetId,
            TulCarrier,
            TcellEntityId,
            Tueid,
            TglobalTti,
            TglobalTtiToDecode,
            ExternalTtti,
            Tnumerology,
            TrntiType,
            Trnti,
            TsamplingFreq,
            TmeasurementState,
            TharqProcess,
            TsymbolsFreqHop1,
            TsymbolsFreqHop2,
            TmeanEvm,
            TmeanEvmPerLayer,
            TevmPerSymbol,
            ExternalTevm,
            TnumPuschOfdmDmrsSymbols,
            TdmrsOfdmSymbolIndex,
            ExternalTdmrsSto,
            TnumLayers,
            ExternalTdmrsStoPerLayer,
            ExternalTdmrsPower,
            ExternalTdmrsCorrelation,
            TnumAntennas,
            ExternalTpowerParameters,
            ExternalTdcLeakageMeasurement,
            TpowerSummary,
            TcrcFeedback,
            TappliedCfo,
            ExternalTdeltaCfo,
            TulTimingOffset,
            TphaseMeas,
            ExternalTphaseMeasurements,
        ]
    ] = field(
        default_factory=list,
        metadata={
            "name": "Field",
            "type": "Element",
            "min_occurs": 33,
            "max_occurs": 39,
        },
    )
tefra commented 3 months ago

Leave it open, I want to take a look with the given sample to see if there is anything we can do to improve the performance...

DareDevilDenis commented 3 months ago

Thanks @tefra. Please let me know if I can help with further testing.

skinkie commented 3 months ago

In your case specifically it's way worse because of all the union fields. The parser will attempt to parse the xml node with every given dataclass. Then it will take the "successful" attempts and return the one with highest score. This is a very crude process and unfortunately it's very slow.

Are you saying there is nothing like namespace + tag index?

tefra commented 2 months ago

Hi @DareDevilDenis, I added an optimization to select earlier the correct element based on fixed attributes, I am not gonna say the performance is now great, but according to my local tests this decreases the ~200 ratio down to ~34