rubys / nokogumbo

A Nokogiri interface to the Gumbo HTML5 parser.
Apache License 2.0
186 stars 114 forks source link

Serialization options for html #92

Closed stevecheckoway closed 6 years ago

stevecheckoway commented 6 years ago

A few days ago, I wrote an implementation for serializing HTML5 according to Serializing HTML fragments. I need to write some more tests, but there are a few design choices that I'd like to lay out.

  1. How should we integrate with Nokogiri's serialization API?
  2. What should we do about pre, listing, and textarea?

Integration with Nokogiri's serialization API

Nokogiri has a bewildering number of different interfaces for serializing a Nokogiri::XML::Node, each of which has several forms for arguments:

There are also private methods

Except for some broken versions of libxml2, everything eventually calls #native_write_to. The key save_with option to control formatting is XML::Node::SaveOptions::FORMAT (corresponding to XML_SAVE_FORMAT).

In table form:

Method Calls Sets default SaveOptions Ultimate default SaveOptions
#inner_html #to_html AS_HTML\|FORMAT
#serialize #write_to FORMAT
#to_html #to_format AS_HTML\|FORMAT AS_HTML\|FORMAT
#to_s #to_html AS_HTML\|FORMAT
#write_html_to #write_format_to AS_HTML\|FORMAT AS_HTML\|FORMAT
#write_to #native_write_to FORMAT FORMAT
#to_format #serialize FORMAT
#write_format_to #write_to FORMAT

As long as neither AS_XML and AS_XHTML is set and a node's document is a HTML_DOCUMENT_NODE, the output will be written as HTML.

When output as HTML, the only thing I can see FORMAT controlling is whether newlines are added after (some, but not all) elements. The questions are where do we want to modify XML::Node to perform HTML5 serialization and do we want to preserve this FORMAT default?

In particular #inner_html should probably follow the standard for serialization by default which means no additional newlines.

#write_to is the natural place to patch but as you can see from the table, #to_html and #write_html_to both add FORMAT.

pre, listing, and textarea ignore leading newlines!

The parsing rules for these elements says that if the token following their start tag is a line feed, then the line feed is ignored "as an authoring convenience."

As an informative example in the standard, after parsing

<pre>

Hello.</pre>

and then serializing and reparsing, the pre element's text content is Hello..

I'm inclined to follow this behavior for #inner_html but it's likely surprising to clients that well-formed HTML doesn't round-trip through serialization and reparsing.

What should we do here? Should some of these follow the standard? Should we introduce a new API?

rubys commented 6 years ago

"round trip" in this context means "produces the same DOM". Not that I can use &amp;, &#38;, and &#x26; and get the same serialization.

+1 to default to the standard. In lieu of new APIs, perhaps additional options can be added?

stevecheckoway commented 6 years ago

Right, round trip meaning getting the same DOM. Which doesn't happen in a bunch of cases, the most egregious of which is line feeds following pre, listing, and textarea. Most of the others have to do with invalid markup or the DOM being (e.g., by JavaScript).

Adding a new option for preserve semantically-meaningful newlines sounds good. E.g.,

<pre>
Hi</pre>

still gets serialized as "<pre>Hi</pre>" but

<pre>

Hi</pre>

gets serialized as "<pre>\n\n</pre>".

Or did you mean an option on the parsing side of things? My preference is to parse per the standard and add a serializing option.

stevecheckoway commented 6 years ago

I still need to write some API tests, but here's what I have so far https://github.com/rubys/nokogumbo/compare/master...stevecheckoway:serialize. By default, all of the various serialization methods give a standards-compliant serialization.

Those that end up passing their options to #write_to (which, off-hand, I think is all of them but #to_s which doesn't take options), can pass trailing_nl: true to get a trailing newline on some elements (to be implemented) and preserve_pre: true to output a semantically equivalent pre, listings, and textarea (not tested).

Naming suggestions for those two options welcome.

stevecheckoway commented 6 years ago

I opted not to bother implementing the adding newlines after some elements and went with preserve_newline: true for preserving multiple newlines at the start of pre, listing, and textarea.