pplu / aws-sdk-perl

A community AWS SDK for Perl Programmers
Other
170 stars 94 forks source link

Switch from XML::Simple to XML::LibXML or XML::Fast #66

Open pplu opened 8 years ago

pplu commented 8 years ago

XML::Simple is discouraged, although it works for us. Maybe a faster and better XML parser would be nice

Grinnz commented 7 years ago

While XML::Fast may be better than XML::Simple (I don't know if this is the case), it doesn't solve the fundamental problems that make XML::Simple a bad idea -- having to deal with endless edge cases of items that could be arrays or single values, and tags that have both attributes and contents... this will be a problem in any XML-to-hash converter.

Using XML::LibXML or something like Mojo::DOM58 or another DOM traversal parser would solve this issue, but would be a much trickier conversion, you have to change how you approach the data.

pplu commented 7 years ago

+1: XML as a datasctructure is a pain :(. Be sure some ugly code is laying around because of that fact.

If you have any suggestions of how you would use XML::LibXML or Mojo::DOM58 in Paws, I'd love to see them.

With the branch I've just pushed I was playing around to see if a quick win is possible (I haven't benchmarked it yet, but it feels faster). Responses from S3 ListObjects with lots of elements are slow with XML::Simple, and seem to be faster with XML::Fast. The bad part is XML::Fast has not been maintained since 2010, so maybe it's not very dependable as a Paws core dep.

Grinnz commented 7 years ago

I will take a look when I get a chance at how a DOM object might be used, but I don't understand enough of the code using XML responses yet.

pplu commented 7 years ago

I've just found out about https://metacpan.org/pod/XML::Compile. It could be an option.

zostay commented 6 years ago

I recommend leaving XML::Simple for security reasons in addition to performance.

There is a significant XXE vulnerability that was reported 4 years ago in RT for XML::Simple. There continues to be no fix or work around that I am aware of. Without doing a code review of every use of XML::Simple to parse XML data in Paws, I'm not exactly sure how the existing vulnerability might be exploited. Basically it works like this, though, given an XML file that has a particular kind of doctype entity tag, XML::Simple will read any arbitrary file from the local disk and embed it into the parsed XML document tree. If the program using that document tree at all passes that data on, it can get logged or otherwise passed through to something else to leak information like password files.

XML::LibXML::Simple can usually be used as a drop-in replacement for parsing XML data while XML::Simple can still be used for writing XML safely (I don't believe XML::LibXML::Simple can generate XML).

pplu commented 6 years ago

@zostay Thanks for the heads-up about XML::LibXML::Simple (didn't know it existed, and was drop-in for XML::Simple)

pplu commented 6 years ago

Hi, I just tried out XML::LibXML::Simple on this branch. It fails tests because of differences in how it converts XML to a datastructure. https://github.com/pplu/aws-sdk-perl/tree/feature/xml-libxml-simple

perl -I lib/ -I auto-lib/ t/10_responses.t t/10_responses/s3-list-objects.1.response
[...]
$VAR1 = {
          'MaxKeys' => '1000',
          'Marker' => {},
          'Prefix' => {},

The problem seems to be that Marker, Prefix, etc, etc are empty HashRefs. The XML::LibXML::Simple documentation states that one of the differences with XML::Simple is that empty elements are not removed. https://metacpan.org/pod/XML::LibXML::Simple#empty-elements-are-not-removed.

AWS returns:

  <?xml version="1.0" encoding="UTF-8"?>
  <ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
    <Name>test-1357854246</Name>
    <Prefix></Prefix>
    <Marker></Marker>
Grinnz commented 6 years ago

I've started taking a look at how Paws could be converted to use a proper XML parser. The fundamental issue is that when doing it correctly, you do not deserialize the XML to a data structure at all but an object which you then query. Thus the method response_to_object in Paws::API::Caller would have to understand this. It currently appears to assume that unserialize_response will return a data structure. Either Paws::Net::XMLResponse and Paws::Net::RestXMLResponse would have to define a new method which returns the XML object, which Paws::API::Caller would then call and use based on some condition, or it would have to check whether unserialize_response returns an object rather than a basic data structure and handle it appropriately. What are your thoughts on the best way to do this in keeping with the existing structure? Mainly, response_to_object also has to handle JSON responses which deserialize cleanly to plain data structures.

If this fundamental issue can be worked out then it makes little difference whether XML::LibXML or Mojo::DOM58 or whatever other parser is used afterward. (And I don't consider XML::LibXML::Simple or XML::Fast or other xml-to-hash converters to be a real solution to this problem.)

Grinnz commented 6 years ago

Another kind of crazy idea: put JSON responses in an object too, using JSON pointers to query them. Unfortunately the implementation I'd want to use for that is Mojo::JSON::Pointer and it's a bit silly to depend on Mojolicious just for that. (JSON::Pointer seems super overcomplicated for this task)

Grinnz commented 6 years ago

A further problem: response_to_object primarily builds the response object by passing it the plain data structure it receives and relying on the constructor to build the object from there. If deserializing to an object instead of a plain data structure, it wouldn't be able to automatically build it this way; it needs to somehow query the object for each element it's looking for.

Grinnz commented 6 years ago

Perhaps this could be solved by a new method for XMLResponse and RestXMLResponse as well: new_from_result_object or so.

Grinnz commented 6 years ago

So it basically boils down to this: what would be the most appropriate way for response_to_object to determine that it needs to work with an XML object (from XMLResponse or RestXMLResponse) rather than a plain data structure (from JsonResponse or RestJsonResponse)?

pplu commented 6 years ago

Take a look at the code if the refactored code in this branc helps https://github.com/pplu/aws-sdk-perl/tree/release/0.37

Grinnz commented 6 years ago

Oh, on that branch each response type defines its own response_to_object, that should make things a lot easier.

pplu commented 6 years ago

🎉 🎈

slobo commented 4 years ago

We did some internal tests, and XML::Hash::XS seemed to work for all requests we are making (various S3 and SQS), and it brought massive speed improvements to some bulk SQS insertion jobs. NYTProf showed that Paws was responsible for about 30-50% of our run-time, using XML::Hash::XS made it so Paws accounts to only about 3-5%, basically doubling the speed of our batch.

And all we did was basically update Paws::Net::XMLResponse like this :

-    my $struct = eval { $self->_xml_parser->parse_string($response->content) };
+    my $struct = eval { xml2hash($response->content) };

Would you be interested if we prepared a PR? IIRC, XML::Simple already requires access to compiler for some of the deps, so using an XS module shouldn't be a burden? Are there any additional (unit) tests that you would require before feeling comfortable with such a change?

Thanks

Grinnz commented 4 years ago

XML::Hash::XS has the exact same design flaws as XML::Simple, and would require checking that edge cases are all handled the same way.

Grinnz commented 4 years ago

Sorry that I've failed to complete this effort for so long, I will hopefully find some time to get back to it.

slobo commented 4 years ago

For sure there could be a more elegant way to handle responses in the long term that can both be efficient and ensure correctness. Thanks for thinking about it and would be great to see what you come up with.

My immediate concern is that in the meantime we are leaving a lot of performance on the table by using XML::Simple, and it seems that a rather simple patch can be a big win.

There is of course the benefit of XML::Simple being battle tested, but hopefully the test suite will catch any issues before it hit anyone in production. AFAIK, Amazon responses are fairly simple, so maybe not a lot of danger in breaking anything?

Thanks.

pplu commented 4 years ago

In the past I tried changing the XML Parser (It's documented in this ticket) without much luck due to difference of handling edge cases, but at least the test suite catches these things.

@slobo @Grinnz : I'd be open to shipping with XML::Hash::XS if the tests pass, as the test suite should be quite thorough with XML responses and their edge cases.

Of course, if @Grinnz can build the proper solution I'd be even happier of getting that merged.

pplu commented 4 years ago

I've just opened the branch https://github.com/pplu/aws-sdk-perl/tree/xml2hash to see how tests look. I've substituted XML::Simple for XML::Hash::XS, and there are test failures. They basically seem related to the fact that Paws configures XML::Simple to undefine empty tags <xxx></xxx> instead of converting them to an empty string.

@slobo : please feel free to work on that branch to get everything going. I'm pretty sure that everything will revolve around getting XML::Hash::XS to produce the same hashes as XML::Simple with these options: https://github.com/pplu/aws-sdk-perl/blob/master/lib/Paws/Net/XMLResponse.pm#L10

byterock commented 4 years ago

I will see if I have some time I will try it aginst this

branch

https://github.com/byterock/aws-sdk-perl/tree/s3ObjectTagging

where all of the S3 actions are now working.

The only problem I found with the XML parsing was the S3 GetBucketLocation where the root node was dropped from the pasing

For notes check out these blog posts

http://blogs.perl.org/users/byterock/2019/10/paws-iv-the-sun-king.html http://blogs.perl.org/users/byterock/2019/10/paws-the-xv-still-some-way-to-go.html

By the way the 10_response test suite in that branch has test for all of the actions (excelpt a few depricated ones that no longer work) of S3.

slobo commented 4 years ago

Awesome, I've created #366 to explore this.