Closed jofo-ivbar closed 4 years ago
A decoding without loading the entire XML in memory has been planned (yes using lazy=True) but it needs a different base iter_decode method for schema class (eg. iter_lazy_decode) that operates on XML first-level nodes. My current priority is to complete XSD 1.1 but I'm open to other contributes with a PR.
There are also other things to change for using a new lazy method from other decode method (eg. tojson() ) but these could be planned and written after.
A new release is available (v1.0.13) that supports lazy XML resource validation and decoding.
For iter_decode
you have to pass with the source an XMLResource
instance created with lazy=True
.
A lazy XML resource uses the ElementTree's iterparse method with a path (eg. '*' for decoding all the root's children in sequence and deleting decoded ones). More works is needed to do this in a transparent way (currently only validation works with a lazy mode on document) but a test with your data maybe useful to know how it works.
Great, thanks! I will try to have a look and see if it helps my use case.
Hi, have you already tested lazy validation mode? In this case have you noted a significant reducing of memory consumption? For the next release (v1.0.15) i want to add valid memory tests for this but I'm curious to know if the current solution seems to be the right way or if it's pretty useless. Thanks
Hi again, sorry for being slow.
I did a little testing with the same data set as before, using the lazy flag as described, and indeed the memory usage went down by a lot! But, on the other hand, the decoding became way slower, by a factor of around 10-100. Is this trade-off expected or is something else going on?
Hi, very well for memory saving! The slowing it's probably due to XPath selection and it's somewhat expected ... I'll try to speed-up it with a cache.
Hi, @jofo-ivbar. You can try the new release v1.0.15 and check if in the lazy-mode the speed is not so bad as in the previous release. I've made some improvements to XPath bindings that should made XPath processing faster.
Nope, the v1.0.15 has the same problem. Lighter XPath bindings do not improve performances.
The answer is in the XMLResource.iterfind
method, related to path selection that is applied to each element node of the document. This is very heavy and cannot benefit of any caching optimization.
So in the next release I will add some heuristics and limitation on path in order to minimize the differences on speed.
The results of a test with a middle size file (about 50 MB ...) are encouraging (the second is the test in lazy mode):
$ time python xmlschema/tests/check_memory.py 7 test_cases/eduGAIN/edugain-v1.xml
Filename: xmlschema/tests/check_memory.py
Line # Mem usage Increment Line Contents
================================================
106 33.0 MiB 33.0 MiB @profile
107 def validate(source):
108 33.0 MiB 0.0 MiB validator = xmlschema.XMLSchema.meta_schema if source.endswith('.xsd') else xmlschema
109 143.4 MiB 110.4 MiB return validator.validate(source)
real 3m56.668s
user 3m55.619s
sys 0m0.284s
$ time python xmlschema/tests/check_memory.py 8 test_cases/eduGAIN/edugain-v1.xml
Filename: xmlschema/tests/check_memory.py
Line # Mem usage Increment Line Contents
================================================
112 33.0 MiB 33.0 MiB @profile
113 def lazy_validate(source):
114 33.0 MiB 0.0 MiB if source.endswith('.xsd'):
115 validator, path = xmlschema.XMLSchema.meta_schema, '*'
116 else:
117 33.0 MiB 0.0 MiB validator, path = xmlschema, None
118 36.0 MiB 3.0 MiB return validator.validate(xmlschema.XMLResource(source, lazy=True), path=path)
real 5m7.022s
user 5m6.012s
sys 0m0.086s
You were too fast for me, but I can at least confirm that 1.0.15 does not significantly improve performance over the earlier version in my test either. The benchmark looks promising, I think it's usually OK to trade of a bit of performance for a large memory saving.
Should be better now in speed with version v1.0.16. Check when you can.
Would it be possible to incorporate the features from my repo? https://github.com/davlee1972/xml_to_json
I'm basically specifying a Xpath and parsing that Xpath into its own json object and then writing that json object to a single line to a output file
The format is officially JSONL (Line delimited json)
Line delimited JSON has many advantages over regular JSON and is used by data science programs like Spark, Python, etc.. since JSONL files are splittable so you can have multiple processes working on a single file.. i.e lines 1 to 100 vs lines 101 to 200..
There is also support for reading a zip archive of xml files or a gzipped compressed xml file as well as outputting a gzipped json / jsonl file which helps when dealing with large XML files
Hi @davlee1972 ,
I had take a look at your code some months ago during the rewriting of XMLResource to include an effective lazy mode.
The JSONL format is very interesting. I'll examine if it can be included in the library (maybe a couple of API at document/package level), then I'll give you an answer. Bye
Iterparse is good for memory savings when consuming large xml files, but you also run into memory issues when the resulting dictionary from to_dict() takes up a large memory footprint.
If you can stream the output from iterparse to a json or jsonl file you remove memory requirements for the dictionary.
Yes, iterparse is not enough to save memory in JSON encoding (as implemented by the current implementation of xmlschema.to_json()).
As you can see in schema.py code i have splitted iter_errors() (validation) from iter_decode() in order to optimize validation and then optimize decode. Maybe the right idea could be to implement a to_json() at schema level or adding other serializing options to iter_encode(). I will try some experiments with JSONEncoder class.
Hi @davlee1972, experiments with JSONEncoder class are going well. Seems feasible to create lazy json encoders (XML --> JSON) using a custom encoder class like:
errors = []
class JSONLazyEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, Iterator):
while True:
result = next(obj, None)
if isinstance(result, XMLSchemaValidationError):
errors.append(result)
else:
return result
return json.JSONEncoder.default(self, obj)
With two lazy scans (iterparse) of the XML file it's possible to create a smaller object filled with a generator instance (using max_depth
argument and a new depth filler argument), eg:
{'@xmlns:col': 'http://example.com/ns/collection', '@xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance', '@xsi:schemaLocation': 'http://example.com/ns/collection collection.xsd', 'object': [<generator object XMLSchemaBase._raw_iter_decode at 0x7ffa0eab33d0>, <generator object XMLSchemaBase._raw_iter_decode at 0x7ffa0eab33d0>]}
And then dump this to json using the custom encoder.
I'm going to add this in v1.1.0 as new experimental feature.
I've been trying the library out and it looks very useful!
However I have one issue, regarding decoding of large files. The behavior right now appears to be to read the entire file into memory immediately and then decoding it. This happens even when using the
iter_decode()
method, which is a bit unexpected from such a method IMHO. E.g. decoding a 150M XML file the python process takes a number of seconds and consumes over 2G of memory before starting to emit any items, so I guess all the parsing is done up front. For huge files that consist of a large number of similar records it would be really nice to be able to iterate through them without keeping the whole thing in memory.I looked into it a little bit and it seemed like it would be possible to change this by changing "lazy=False" on the XMLResource being created. This does produce a few records as expected, but then it breaks seemingly because only the first 16KB were read from the file. It looks like this number may come from some buffer size within lxml but I didn't look further into this.
Are there fundamental reasons that this does not work? I suppose it's not really possible to validate the schema without checking the entire file, but I think there are use cases where it's OK to not validate the entire file first.