Closed TechnikEmpire closed 8 years ago
There's a HTML fragment serialization algorithm spec for this, including escaping routines for attribute values. It's the spec upon which Element.innerHTML
and Element.outerHTML
are based.
@craigbarnes Thanks for the links, that's much better than the spec I've been using.
Attribute values can be escaped, but this serialize.cc was written only as an example to help visualize what the parsr was doing based on the actual gumbo-node tree, not on the underlying text unless the node tag as unknown.
Yes, assuming no changes are made to the gumbo tree after parsing, you could use the orig text to serialize things faster, but you would not be representing the actual node values in the gumbo tree.
@TechnikEmpire nah, I prefer yours.
@craigbarnes haha yeehaw. @kevinhendricks I realize it's an example I'm not trying to be critical. With your clarification of representing whats in the actual tree, it makes sense to work as it does. Anyway I'm just using it as a base for some backend stuff to a port of gumbo-query, doing lots of rewriting and noticing things, sharing them as they come up. Thanks for clarifying. I guess I should close this out?
What exactly is the issue with cnn.com? What char-ref's are missing. The spec for escaping attribute values calls for
but that makes no sense since named entities are not allowed in html5, so at best this should use a numeric entity. And does the serialize actually corrupt the cnn.com html given the char entities are properly converted to utf-8 chars when output and the values are xml entitiy escaped? If you could point me at the problem, that would help.
ps. the inline code is a remnant of a much more advanced version that exists in my own gumbo tree, that I forgot to remove ... sorry about that.
@kevinhendricks Let me get back to you, with my version I was aiming to have the serialize output be 1:1 with the input html string. If you save the serialized output and do a diff versus say view-source, you'll see that (at least on today's front page of CNN) a lot of character references are consumed and lost, because the values are serialized back out from ->value
instead of ->original_value
. I understand it is proper behavior for the character references to be consumed and converted, I was (erroneously) expecting the serialize output to be a full reconstruction of the original input.
I believe I was hasty with using words like "bug" and "incorrect" after you clarified your intended purpose for the example. I believe now my goal/expected output was outside the scope of intended use. Comments have been edited to remove stuff based on my incorrect assumptions. I'm thinking maybe I should just close this.
The
build_attributes
method in theserialize.cc
example improperly builds attribute names and values usingGumboAttribute*->name
andGumboAttribute*->value
. This fundamentally breaks the output HTML because all character references are consumed and replaced during parsing. The correct thing to do would be to copy the original values:Otherwise you get broken output. Can be observed by testing against cnn.com.
I'm curious if the
original_text
buffer can possibly have nested raw html text, because if so, serialization this way could lead to fudged output as well.