Decoding without loading the whole file - Githubissues

sissaschool / xmlschema

XML Schema validator and data conversion library for Python

MIT License

422 stars 72 forks source link

Decoding without loading the whole file #102

Closed jofo-ivbar closed 4 years ago

jofo-ivbar commented 5 years ago

I've been trying the library out and it looks very useful!

However I have one issue, regarding decoding of large files. The behavior right now appears to be to read the entire file into memory immediately and then decoding it. This happens even when using the iter_decode() method, which is a bit unexpected from such a method IMHO. E.g. decoding a 150M XML file the python process takes a number of seconds and consumes over 2G of memory before starting to emit any items, so I guess all the parsing is done up front. For huge files that consist of a large number of similar records it would be really nice to be able to iterate through them without keeping the whole thing in memory.

I looked into it a little bit and it seemed like it would be possible to change this by changing "lazy=False" on the XMLResource being created. This does produce a few records as expected, but then it breaks seemingly because only the first 16KB were read from the file. It looks like this number may come from some buffer size within lxml but I didn't look further into this.

Are there fundamental reasons that this does not work? I suppose it's not really possible to validate the schema without checking the entire file, but I think there are use cases where it's OK to not validate the entire file first.

brunato commented 5 years ago

A decoding without loading the entire XML in memory has been planned (yes using lazy=True) but it needs a different base iter_decode method for schema class (eg. iter_lazy_decode) that operates on XML first-level nodes. My current priority is to complete XSD 1.1 but I'm open to other contributes with a PR.

There are also other things to change for using a new lazy method from other decode method (eg. tojson() ) but these could be planned and written after.

brunato commented 5 years ago

A new release is available (v1.0.13) that supports lazy XML resource validation and decoding. For iter_decode you have to pass with the source an XMLResource instance created with lazy=True. A lazy XML resource uses the ElementTree's iterparse method with a path (eg. '*' for decoding all the root's children in sequence and deleting decoded ones). More works is needed to do this in a transparent way (currently only validation works with a lazy mode on document) but a test with your data maybe useful to know how it works.

jofo-ivbar commented 5 years ago

Great, thanks! I will try to have a look and see if it helps my use case.

brunato commented 5 years ago

Hi, have you already tested lazy validation mode? In this case have you noted a significant reducing of memory consumption? For the next release (v1.0.15) i want to add valid memory tests for this but I'm curious to know if the current solution seems to be the right way or if it's pretty useless. Thanks

jofo-ivbar commented 5 years ago

Hi again, sorry for being slow.

I did a little testing with the same data set as before, using the lazy flag as described, and indeed the memory usage went down by a lot! But, on the other hand, the decoding became way slower, by a factor of around 10-100. Is this trade-off expected or is something else going on?

brunato commented 5 years ago

Hi, very well for memory saving! The slowing it's probably due to XPath selection and it's somewhat expected ... I'll try to speed-up it with a cache.

brunato commented 5 years ago

Hi, @jofo-ivbar. You can try the new release v1.0.15 and check if in the lazy-mode the speed is not so bad as in the previous release. I've made some improvements to XPath bindings that should made XPath processing faster.

brunato commented 5 years ago

Nope, the v1.0.15 has the same problem. Lighter XPath bindings do not improve performances. The answer is in the XMLResource.iterfind method, related to path selection that is applied to each element node of the document. This is very heavy and cannot benefit of any caching optimization. So in the next release I will add some heuristics and limitation on path in order to minimize the differences on speed.

The results of a test with a middle size file (about 50 MB ...) are encouraging (the second is the test in lazy mode):

$ time python xmlschema/tests/check_memory.py 7 test_cases/eduGAIN/edugain-v1.xml
Filename: xmlschema/tests/check_memory.py

Line #    Mem usage    Increment   Line Contents
================================================
   106     33.0 MiB     33.0 MiB   @profile
   107                             def validate(source):
   108     33.0 MiB      0.0 MiB       validator = xmlschema.XMLSchema.meta_schema if source.endswith('.xsd') else xmlschema
   109    143.4 MiB    110.4 MiB       return validator.validate(source)

real    3m56.668s
user    3m55.619s
sys 0m0.284s

$ time python xmlschema/tests/check_memory.py 8 test_cases/eduGAIN/edugain-v1.xml
Filename: xmlschema/tests/check_memory.py

Line #    Mem usage    Increment   Line Contents
================================================
   112     33.0 MiB     33.0 MiB   @profile
   113                             def lazy_validate(source):
   114     33.0 MiB      0.0 MiB       if source.endswith('.xsd'):
   115                                     validator, path = xmlschema.XMLSchema.meta_schema, '*'
   116                                 else:
   117     33.0 MiB      0.0 MiB           validator, path = xmlschema, None
   118     36.0 MiB      3.0 MiB       return validator.validate(xmlschema.XMLResource(source, lazy=True), path=path)

real    5m7.022s
user    5m6.012s
sys 0m0.086s

jofo-ivbar commented 5 years ago

You were too fast for me, but I can at least confirm that 1.0.15 does not significantly improve performance over the earlier version in my test either. The benchmark looks promising, I think it's usually OK to trade of a bit of performance for a large memory saving.

brunato commented 5 years ago

Should be better now in speed with version v1.0.16. Check when you can.

davlee1972 commented 4 years ago

Would it be possible to incorporate the features from my repo? https://github.com/davlee1972/xml_to_json

I'm basically specifying a Xpath and parsing that Xpath into its own json object and then writing that json object to a single line to a output file

The format is officially JSONL (Line delimited json)

Line delimited JSON has many advantages over regular JSON and is used by data science programs like Spark, Python, etc.. since JSONL files are splittable so you can have multiple processes working on a single file.. i.e lines 1 to 100 vs lines 101 to 200..

There is also support for reading a zip archive of xml files or a gzipped compressed xml file as well as outputting a gzipped json / jsonl file which helps when dealing with large XML files

brunato commented 4 years ago

Hi @davlee1972 ,

I had take a look at your code some months ago during the rewriting of XMLResource to include an effective lazy mode.

The JSONL format is very interesting. I'll examine if it can be included in the library (maybe a couple of API at document/package level), then I'll give you an answer. Bye

davlee1972 commented 4 years ago

Iterparse is good for memory savings when consuming large xml files, but you also run into memory issues when the resulting dictionary from to_dict() takes up a large memory footprint.

If you can stream the output from iterparse to a json or jsonl file you remove memory requirements for the dictionary.

https://jsonlines.readthedocs.io/en/latest/

brunato commented 4 years ago

Yes, iterparse is not enough to save memory in JSON encoding (as implemented by the current implementation of xmlschema.to_json()).

As you can see in schema.py code i have splitted iter_errors() (validation) from iter_decode() in order to optimize validation and then optimize decode. Maybe the right idea could be to implement a to_json() at schema level or adding other serializing options to iter_encode(). I will try some experiments with JSONEncoder class.

brunato commented 4 years ago

Hi @davlee1972, experiments with JSONEncoder class are going well. Seems feasible to create lazy json encoders (XML --> JSON) using a custom encoder class like:

errors = []

class JSONLazyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Iterator):
            while True:
                result = next(obj, None)
                if isinstance(result, XMLSchemaValidationError):
                    errors.append(result)
                else:
                    return result
        return json.JSONEncoder.default(self, obj)

With two lazy scans (iterparse) of the XML file it's possible to create a smaller object filled with a generator instance (using max_depth argument and a new depth filler argument), eg:

{'@xmlns:col': 'http://example.com/ns/collection', '@xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance', '@xsi:schemaLocation': 'http://example.com/ns/collection collection.xsd', 'object': [<generator object XMLSchemaBase._raw_iter_decode at 0x7ffa0eab33d0>, <generator object XMLSchemaBase._raw_iter_decode at 0x7ffa0eab33d0>]}

And then dump this to json using the custom encoder.

I'm going to add this in v1.1.0 as new experimental feature.