libxml-raku / LibXML-raku

Raku bindings to the libxml2 native library
Artistic License 2.0
11 stars 5 forks source link

Serializing of a parsed HTML produces CDATA for <style> #109

Closed vrurg closed 10 months ago

vrurg commented 10 months ago

Best is an example:

my $html = q:to/HTML/;
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8"/>
    <style>hr,
img {
    box-sizing: content-box
}
</style>

</head>

<body class="pod">
    Test
</body>
</html>
HTML

sub MAIN(IO() $html) {
    my $xml = LibXML.parse: $html, :html, :huge, :recover(2);
    say $xml.Str(:html);
}

The essential part of its output is following:

    <style><![CDATA[hr,
img {
    box-sizing: content-box
}
]]></style>

Due to this CDATA entry no browser displays the processed page. A workaround is to manually pull in the CDATA node, unbind it and set <style> text to the content of the node.

I suspect this to be a problem on the libxml2 side, but better report it anyway.

vrurg commented 10 months ago

Ok, my bad as I forgot that the text is getting HTML-encoded resulting in broken stylesheet.

Now it looks like I'm out of options because serialization into HTML happens without participation of LibXML::Node except for the initial call to Str/serialize-html methods. The raw nodes are natives, so no mixins to override their Str.

I have an idea, but it is a cumbersome hack.

dwarring commented 10 months ago

I just changed the Str(:html) from setting output XML_SAVE_XHTML to XML_SAVE_AS_HTML. This stops libxml from writing CDATA. Output is now:

 <!DOCTYPE html>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><html lang="en">
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta charset="UTF-8">
    <style>hr,
img {
    box-sizing: content-box
}
</style>

</head>

<body class="pod">
    Test
</body>
</html>

This might be a better mapping of this option, if browsers don't handle CDATA.

vrurg commented 10 months ago

Great!

Can I do it without switching to HEAD? Looks like the only way is to call Str on raw.

dwarring commented 10 months ago

This should do it:

use LibXML::Enums;
say $xml.Str(:options(XML_SAVE_AS_HTML));
vrurg commented 10 months ago

I undoubtedly overlooked the :$options parameter of output-options. It does the trick, thank you!

This can now be closed, considering https://github.com/libxml-raku/LibXML-raku/commit/468bc92e7cf353c6931b0238932d5327cb91269c.