ohler55 / ox

Ruby Optimized XML Parser
http://www.ohler.com/ox
MIT License
900 stars 76 forks source link

Confused about `convert_special` and sax_html parsing #358

Closed davetron5000 closed 1 month ago

davetron5000 commented 1 month ago

Hi, sorry if this is not the best place for questions or support, but I guess it's possible my issue is a bug.

I'm trying to use Ox to parse HTML5, and I'm finding that it is escaping < and & in attributes, text, and CDATA. I understand this is correct behavior for XML, so I set convert_special, but it doesn't have the effect I'm looking for:

<html>
<head>
<style>
  <![CDATA[
    .foo {
      content: ">";
    }
  ]]>
</style>
</head>
<body>
<h1>Hello</h1>
</body>
</html>

When I parse this using a class passed to Ox.sax_html, text(), and cdata() are both given escaped strings, so if I try to recreate that <style> block, it will show content: "&gt;";.

So, question is - is this correct behavior and, if so, can it be controlled and/or disabled?

ohler55 commented 1 month ago

Looking at the code, attributes and text use the :convert_special option but CDATA does not. Can you provide the code (handler) that received the &gt; string?

davetron5000 commented 1 month ago

OK, in putting together a minimal example, I’m realizing the behavior is not in sax_parse, but I was also creating a document and it's that that was escaping the values, which seems reasonable and consistent with the docs. Sorry for the bother, but thanks for being responsive!