ohler55 / ox

Ruby Optimized XML Parser
http://www.ohler.com/ox
MIT License
904 stars 76 forks source link

a new mode like :hash_no_attrs but with included attributes #347

Open xkwd opened 1 year ago

xkwd commented 1 year ago

Hello Peter, thank you so much for so efficient Ox and Oj gems!

I am trying to replace Savon (which uses Nokogiri for XML parsing) with Ox in multiple heavily loaded micro services for performance reasons. Below are samples with tiny fractions of XMLs, parsed with both :hash_no_attrs and :hash modes:

# :hash_no_attrs
{
  nodes: [
    {
      services: [
        { service_id: { id: '100', status_id: '400' }, update_id: '500' },
        { service_id: { id: '200', status_id: '400' }, update_id: '500' },
        { service_id: { id: '300', status_id: '400' }, update_id: '500' },
      ],
      id: '82383838383838',
      nodes: { id: '8888888' },
      quantities: [
        { size: '122', id: { code: '900', node: '5' } },
        { size: '103', id: { code: '900', node: '10' } },
        { size: '92', id: { code: '900', node: '20' } }
      ],  
      time: '2023-10-20T05:05:00.000+01:00',
      type: {
        id: '9000',
        mode: { id: '2828288', protocol: '7000' },
      },
      informations: ('2', '17', '64', '1157', '1604', '100008')
    },
    {
      services: [
        { service_id: { id: '400', status_id: '500' }, update_id: '600' },
        { service_id: { id: '500', status_id: '500' }, update_id: '700' },
        { service_id: { id: '600', status_id: '500' }, update_id: '700' }
      ],
      id: '92829292992',
    }
  ]
}
# :hash
{
  nodes: [
    [
      {
        "xmlns:ns4": 'http://Model/Status/Protocol/',
        "xmlns:xsi": 'http://www.w3.org/2001/XMLSchema-instance',
        "xsi:type": 'ns4:ServiceProtocol'
      },
      { services: { service_id: { id: '100', status_id: '400' }, update_id: '500' } },
      { services: { service_id: { id: '200', status_id: '400' }, update_id: '500' } },
      { services: { service_id: { id: '300', status_id: '400' }, update_id: '500' } },
      { id: '82383838383838' },
      { nodes: { id: '8888888' } },
      { quantities: { size: '122', id: { code: '900', node: '5' } } },
      { quantities: { size: '103', id: { code: '900', node: '10' } } },
      { quantities: { size: '92', id: { code: '900', node: '20' } } },
      { time: '2023-10-20T05:05:00.000+01:00' },
      {
        type: {
          id: '9000',
          mode: { id: '2828288', protocol: '7000' }
        }
      },
      { informations: '2' },
      { informations: '17' },
      { informations: '64' },
      { informations: '1157' },
      { informations: '1604' },
      { informations: '100008' }
    ],
    [
      {
        "xmlns:ns4": 'http://Model/Status/Protocol/',
        "xmlns:xsi": 'http://www.w3.org/2001/XMLSchema-instance',
        "xsi:type": 'ns4:ServiceShop'
      },
      { services: { service_id: { id: '400', status_id: '500' }, update_id: '600' } },
      { services: { service_id: { id: '500', status_id: '500' }, update_id: '700' } },
      { services: { service_id: { id: '600', status_id: '500' }, update_id: '700' } },
      { id: '92829292992' }
    ]
  ]
}

The :hash_no_attrs mode gives the most desirable output to work with (it is a hash), but unfortunately can't be used because attributes are missing. The :hash mode includes missing attributes, but its output structure is significantly different from the the :hash one - it is an array instead of a hash.

Doing mapping of an API response to some internal models is much simpler when accessing a hash by known keys rather than iterating over an array and looking for matching elements. Especially when dealing with thousands of lines, when every millisecond is important. An array could be transformed to a hash after initial parsing, but that would mitigate performance gains from using Ox.

I very well realize that you already mentioned in other issues that the two modes, :hash_no_attrs and :hash, are enough for most cases, but I would really appreciate if you could consider adding another mode, identical to :hash_no_attrs in terms of its output structure, but with attributes included as hash elements (instead of an extra hash with attributes like in the :hash mode)? (please see an example below):

  # :hash_no_attrs format + attributes
{
  nodes: [
    {
      services: [
        { service_id: { id: '100', status_id: '400' }, update_id: '500' },
        { service_id: { id: '200', status_id: '400' }, update_id: '500' },
        { service_id: { id: '300', status_id: '400' }, update_id: '500' },
      ],
      ...
      "@xmlns:ns4": 'http://Model/Status/Protocol/',
      "@xmlns:xsi": 'http://www.w3.org/2001/XMLSchema-instance',
      "@xsi:type": 'ns4:ServiceProtocol',
    },
    {
      services: [
        { service_id: { id: '400', status_id: '500' }, update_id: '600' },
        { service_id: { id: '500', status_id: '500' }, update_id: '700' },
        { service_id: { id: '600', status_id: '500' }, update_id: '700' }
      ],
      ...
      "@xmlns:ns4": 'http://Model/Status/Protocol/',
      "@xmlns:xsi": 'http://www.w3.org/2001/XMLSchema-instance',
      "@xsi:type": 'ns4:ServiceShop',
    }
  ]
}

Thank you πŸ™‡πŸ»

ohler55 commented 1 year ago

Have you considered using the SAX parser?

Post the actual XML too for further discussion.

xkwd commented 1 year ago

Thank you for such a prompt reply πŸ™‚

Have you considered using the SAX parser?

I have been thinking to test it, but was somehow afraid that it could be slower than Ox.load.

Below is the XML:

<nodes xmlns:ns4="http://Model/Status/Protocol/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="ns4:ServiceProtocol">
  <services>
    <serviceId>
      <id>100</id>
      <statusId>400</statusId>
    </serviceId>
    <updateId>500</updateId>
  </services>
  <services>
    <serviceId>
      <id>200</id>
      <statusId>400</statusId>
    </serviceId>
    <updateId>500</updateId>
  </services>
  <services>
    <serviceId>
      <id>300</id>
      <statusId>400</statusId>
    </serviceId>
    <updateId>500</updateId>
  </services>
  <id>82383838383838</id>
  <nodes>
    <id>8888888</id>
  </nodes>
  <quantities>
    <size>122</size>
    <id>
      <code>900</code>
      <node>5</node>
    </id>
  </quantities>
  <quantities>
    <size>103</size>
    <id>
      <code>900</code>
      <node>10</node>
    </id>
  </quantities>
  <quantities>
    <size>92</size>
    <id>
      <code>900</code>
      <node>20</node>
    </id>
  </quantities>
  <time>2023-10-20T05:05:00.000+01:00</time>
  <type>
    <id>9000</id>
    <mode>
      <id>2828288</id>
      <protocol>7000</protocol>
    </mode>
  </type>
  <informations>2</informations>
  <informations>17</informations>
  <informations>64</informations>
  <informations>1157</informations>
  <informations>1604</informations>
  <informations>100008</informations>
</nodes>
<nodes xmlns:ns4="http://Model/Status/Protocol/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="ns4:ServiceShop">
  <services>
    <serviceId>
      <id>400</id>
      <statusId>500</statusId>
    </serviceId>
    <updateId>600</updateId>
  </services>
  <services>
    <serviceId>
      <id>500</id>
      <statusId>500</statusId>
    </serviceId>
    <updateId>700</updateId>
  </services>
  <services>
    <serviceId>
      <id>600</id>
      <statusId>500</statusId>
    </serviceId>
    <updateId>700</updateId>
  </services>
  <id>92829292992</id>
</nodes>
ohler55 commented 1 year ago

The nice things about the SAX parser is that you can ignore stuff you don't need. I don't know if that applies in you case though.

Thanks for the XML.

It looks like the only attributes are in the nodes element.

xkwd commented 1 year ago

The nice things about the SAX parser is that you can ignore stuff you don't need. I don't know if that applies in you case though.

Oh, I already learnt that ignoring any elements is not an option, because certain parts of an original XML have to be re-used and for that should not be altered in any way πŸ™ This is actually the main issue I am dealing with right now - how to efficiently parse a very large XML with all its elements and attributes into a hash format for easy mapping with Ruby.

It looks like the only attributes are in the nodes element.

Yes, and when I send back this XML (and many other ones) without attributes to an API, I get a validation error.

ohler55 commented 1 year ago

So if I can summarize you are looking for the hash format but using a map instead of a list and then merging the elements. That would lose the information about the order . If that is not important it might be possible. Can I ask you to try the SAX parser first and then I'll see how alternate formats might be supported.

xkwd commented 1 year ago

So if I can summarize you are looking for the hash format but using a map instead of a list and then merging the elements.

Sounds correct, and if I understand correctly the using a map instead of a list and then merging the elements part, is already implemented with the :hash_no_attrs mode.

That would lose the information about the order . If that is not important it might be possible.

I guess when dealing with a hash format, the order is not that important.

Can I ask you to try the SAX parser first

Sure, I will give it a try and will get back with my findings πŸ‘πŸ»

xkwd commented 1 year ago

Hello again, so I have just released a tiny wrapper for the SAX parser called OXML, which I have already tested on multiple applications. It successfully solves the issue of missing attributes, it is at least ~2.5-4x faster than Savon with its built in Nori gem. However, with Ox.load I am able to achieve an extra ~5-10x performance increase on top of the SAX parser, depending on whether the typecasting option is or not used when parsing with SAX. Therefore, I am planning to use Ox.load for applications where XML attributes are not used, and would be extremely grateful for having another mode with attributes included, so that the slower SAX parser could be only used as a fallback parser for cases when typecasting is needed πŸ™‚

ohler55 commented 1 year ago

Super! The wrapper looks good. Nice that the performance is that much better as well.

Uelb commented 7 months ago

I also needed that feature on my side and I tool a different approach (not sure if it's the best but I'll expose it anyway).

I use libxslt and their command line tool xslt to transform the XML with attributes and add them as element with this stylesheet :

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="xml" />

  <xsl:template match="*">
    <xsl:element name="{name()}">
      <xsl:for-each select="@*">
        <xsl:element name="{name()}">
          <xsl:value-of select="." />
        </xsl:element>
      </xsl:for-each>
      <xsl:apply-templates
        select="*|text()" />
    </xsl:element>
  </xsl:template>

</xsl:stylesheet>

I then feed the result to Ox.load like this to get the intended result.

Ox.load(`xsltproc #{xsl_path} #{xml_path}`, mode: :hash)

xsltproc is quite fast, even though it would probably be faster to generate the correct result directly with an additional mode in the C codebase. If I need it to be faster later, I'll try and implement such a mode.