I would like to write huge trees but don't retain the entire tree in memory

tefra / xsdata

Naive XML & JSON Bindings for python

https://xsdata.readthedocs.io

MIT License

324 stars 59 forks source link

I would like to write huge trees but don't retain the entire tree in memory #1031

Open skinkie opened 5 months ago

skinkie commented 5 months ago

Ideally I would like to write out a tree where the data is added just in time. The proposal in #1030 has an increasing memory usage, which suggests that the tree is still being build completely in memory. I wanted to add some evidence. Please ignore the timing.

Using the generator method: mem-graph-generator

Materializing into a list first: mem-graph

Ideally, I wish that the memory consumption wouldn't increase at all, and the data would just been written out as it would be provided. But I guess the graphs do give a clear view where we can make some improvements when writing out huge documents.

tefra commented 5 months ago

We need to fully support the Iterable type annotation for infinite generators in the data models, and the serializers.

The pr is a good first attempt @skinkie but it needs some more work

skinkie commented 4 months ago

Doing a 3.4GB file using generators, takes ~12GB of memory to write using LxmlEventwriter. XmlEventWriter does absolutely not take any memory while writing to disk, and it does it in a streaming fashion. I think this must be investigated, especially if LxmlEventWriter is the default. I rewrote my whole project to split up stuff because I was under the impression I couldn't get it stored in memory.

tefra commented 4 months ago

It's mentioned in a few places in the docs

https://xsdata.readthedocs.io/en/latest/data_binding/xml_serializing/#alternative-writers https://xsdata.readthedocs.io/en/latest/api/formats/dataclass/serializers/writers/lxml/#xsdata.formats.dataclass.serializers.writers.lxml.LxmlEventWriter https://xsdata.readthedocs.io/en/latest/api/formats/dataclass/serializers/writers/native/#xsdata.formats.dataclass.serializers.writers.native.XmlEventWriter

For normal use cases, the lxml writer is always faster, 3.4GB xml is not very common 😄

skinkie commented 4 months ago

@tefra it is mentioned that there are alternatives, but not the characteristics of the two.