google / gumbo-parser

An HTML5 parsing library in pure C99
Apache License 2.0
5.16k stars 660 forks source link

Serialize.cc improperly builds text and attributes #355

Closed TechnikEmpire closed 8 years ago

TechnikEmpire commented 8 years ago

The build_attributes method in the serialize.cc example improperly builds attribute names and values using GumboAttribute*->name and GumboAttribute*->value. This fundamentally breaks the output HTML because all character references are consumed and replaced during parsing. The correct thing to do would be to copy the original values:

atts.append(at->original_name.data, at->original_name.length);
atts.append(at->original_value.data, at->original_value.length);
std::string(child->v.text.original_text.data, child->v.text.original_text.length);

Otherwise you get broken output. Can be observed by testing against cnn.com.

I'm curious if the original_text buffer can possibly have nested raw html text, because if so, serialization this way could lead to fudged output as well.

craigbarnes commented 8 years ago

There's a HTML fragment serialization algorithm spec for this, including escaping routines for attribute values. It's the spec upon which Element.innerHTML and Element.outerHTML are based.

TechnikEmpire commented 8 years ago

@craigbarnes Thanks for the links, that's much better than the spec I've been using.

kevinhendricks commented 8 years ago

Attribute values can be escaped, but this serialize.cc was written only as an example to help visualize what the parsr was doing based on the actual gumbo-node tree, not on the underlying text unless the node tag as unknown.

Yes, assuming no changes are made to the gumbo tree after parsing, you could use the orig text to serialize things faster, but you would not be representing the actual node values in the gumbo tree.

craigbarnes commented 8 years ago

@TechnikEmpire nah, I prefer yours.

TechnikEmpire commented 8 years ago

@craigbarnes haha yeehaw. @kevinhendricks I realize it's an example I'm not trying to be critical. With your clarification of representing whats in the actual tree, it makes sense to work as it does. Anyway I'm just using it as a base for some backend stuff to a port of gumbo-query, doing lots of rewriting and noticing things, sharing them as they come up. Thanks for clarifying. I guess I should close this out?

kevinhendricks commented 8 years ago

What exactly is the issue with cnn.com? What char-ref's are missing. The spec for escaping attribute values calls for

  

but that makes no sense since named entities are not allowed in html5, so at best this should use a numeric entity. And does the serialize actually corrupt the cnn.com html given the char entities are properly converted to utf-8 chars when output and the values are xml entitiy escaped? If you could point me at the problem, that would help.

ps. the inline code is a remnant of a much more advanced version that exists in my own gumbo tree, that I forgot to remove ... sorry about that.

TechnikEmpire commented 8 years ago

@kevinhendricks Let me get back to you, with my version I was aiming to have the serialize output be 1:1 with the input html string. If you save the serialized output and do a diff versus say view-source, you'll see that (at least on today's front page of CNN) a lot of character references are consumed and lost, because the values are serialized back out from ->value instead of ->original_value. I understand it is proper behavior for the character references to be consumed and converted, I was (erroneously) expecting the serialize output to be a full reconstruction of the original input.

I believe I was hasty with using words like "bug" and "incorrect" after you clarified your intended purpose for the example. I believe now my goal/expected output was outside the scope of intended use. Comments have been edited to remove stuff based on my incorrect assumptions. I'm thinking maybe I should just close this.