Performance issue when parsing large base64Binary data

J20S commented 8 years ago

Hello, I have experienced performance issues when trying uploading large files. For example, say we have the following schema: <xsd:complexType name="uploadFileRequest"> <xsd:sequence> <xsd:element name="file" type="xsd:base64Binary" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType>

When the file is large, say 7MB, I can notice significant performance issue. I have located the problem to CreateFromDocument() function which is generated by pyxb. It is used to "Parse the given XML and use the document element to create a Python instance".

More specifically, it is the following line in the above method which takes majority of the time to execute: saxer.parse(io.BytesIO(xmld)) where xmld is the xml string that is passed into this function.

I posted this issue on Source Forge, thanks @pabigot for pointing out that it is the regex match that is costing the majority of the time.

# This is what it costs to try to be a validating processor. if cls.__Lexical_re.match(xmlt) is None: raise SimpleTypeValueError(cls, xmlt)

If we comment this code block out, this issue is fixed. However, since the above code is about "As PyXB is a validating processor it must check whether the incoming encoded data is a valid XML representation. (Peter)", it would be good to have a workaround for this to be part of future releases.

Thanks a lot for your help! @pabigot

Cheers, James

pabigot commented 8 years ago

The check exists because:

# base64 is too lenient: it accepts 'ZZZ=' as an encoding of 'e', while
# the required XML Schema production requires 'ZQ=='.  Define a regular
# expression per section 3.2.16.

Proposed workaround is to add API that allows the user to specify a maximum size for base64 literals that will be validated against the XML requirement disallowing. Setting this to zero would disable the extra check; setting it to (say) 64 would keep the check for small values while avoiding it for file uploads.

The default would be None meaning that the validation would always be performed; applications that use large files would have to intentionally disable the check.

This should be in the next release, whenever that happens.

J20S commented 8 years ago

Hi Peter,

Thanks for providing this feature in the new release!

I expect the usage of this feature is that we manually add something like: pyxb.binding.datatypes.base64Binary.XsdValidateLength(-1) in the binding file generated by pyxbgen command to disable the validation.

I really want to get the above process automated. So except for writing extra scripts for it, is there any chance that we can configure it in command line options?

I understand this might be a separate feature request, but if there is an existing workaround, it will be awesome!

Cheers, James

pabigot commented 8 years ago

The setting doesn't go into the binding file; it's a configuration change that affects validation globally, so just disable the validation once in the application that uses the bindings. If you have base64Binary values that you still want fully validated you'll need to set and clear it in the application depending on whether the specific document is likely to be affected. There is no way to limit the validation to specific elements or namespaces.

J20S commented 8 years ago

Thanks Peter, I got it now!

pabigot / pyxb

Performance issue when parsing large base64Binary data #50