Closed stevecheckoway closed 6 years ago
"round trip" in this context means "produces the same DOM". Not that I can use &
, &
, and &
and get the same serialization.
+1 to default to the standard. In lieu of new APIs, perhaps additional options can be added?
Right, round trip meaning getting the same DOM. Which doesn't happen in a bunch of cases, the most egregious of which is line feeds following pre
, listing
, and textarea
. Most of the others have to do with invalid markup or the DOM being (e.g., by JavaScript).
Adding a new option for preserve semantically-meaningful newlines sounds good. E.g.,
<pre>
Hi</pre>
still gets serialized as "<pre>Hi</pre>"
but
<pre>
Hi</pre>
gets serialized as "<pre>\n\n</pre>"
.
Or did you mean an option on the parsing side of things? My preference is to parse per the standard and add a serializing option.
I still need to write some API tests, but here's what I have so far https://github.com/rubys/nokogumbo/compare/master...stevecheckoway:serialize. By default, all of the various serialization methods give a standards-compliant serialization.
Those that end up passing their options to #write_to
(which, off-hand, I think is all of them but #to_s
which doesn't take options), can pass trailing_nl: true
to get a trailing newline on some elements (to be implemented) and preserve_pre: true
to output a semantically equivalent pre
, listings
, and textarea
(not tested).
Naming suggestions for those two options welcome.
I opted not to bother implementing the adding newlines after some elements and went with preserve_newline: true
for preserving multiple newlines at the start of pre
, listing
, and textarea
.
A few days ago, I wrote an implementation for serializing HTML5 according to Serializing HTML fragments. I need to write some more tests, but there are a few design choices that I'd like to lay out.
pre
,listing
, andtextarea
?Integration with Nokogiri's serialization API
Nokogiri has a bewildering number of different interfaces for serializing a
Nokogiri::XML::Node
, each of which has several forms for arguments:#inner_html(options = {})
callsto_html(options)
on each child and joins the results#serialize(options, &block)
(alternatively#serialize(encoding = nil, save_with = nil, &block)
which is equivalent to passing an options hash/keyword arguments with keys:encoding
and:save_with
) which callswrite_to io, options, &block
whereio
is a newStringIO
#to_html(options = {})
which callsto_format SaveOptions::DEFAULT_HTML, options
#to_s
which (for non-XML documents) callsto_html
#write_html_to(io, options = {})
which callswrite_format_to SaveOptions::DEFAULT_HTML, io, options
#write_to(io, options, &block)
(alternatively#write_to(io, encoding, save_with, &block
) which yieldsoptions[:save_with]
(defaulting toXML::Node::SaveOptions::FORMAT
) and then calls`(io, encoding, indent_string, config.options
whereindent_string
doesn't matter for html andconfig
is theSaveOptions
There are also private methods
#to_format(save_option, options)
which callsserialize(options)
withoptions[:save_with] = save_option
unlessoptions[:save_with]
already exists; old versions of libxml2 cause this to calldump_html
instead#write_format_to(save_options, io, options)
which callswrite_to io, options
withoptions[:save_with] = save_option
unlessoptions[:save_with]
already exists; old versions of libxml2 cause this to calldump_html
instead#dump_html
callshtmlNodeDump
fromlibxml2
#native_write_to(io, encoding, indent_string, config_options)
callsxmlSaveToIO
fromlibxml2
Except for some broken versions of libxml2, everything eventually calls
#native_write_to
. The keysave_with
option to control formatting isXML::Node::SaveOptions::FORMAT
(corresponding toXML_SAVE_FORMAT
).In table form:
SaveOptions
SaveOptions
#inner_html
#to_html
AS_HTML\|FORMAT
#serialize
#write_to
FORMAT
#to_html
#to_format
AS_HTML\|FORMAT
AS_HTML\|FORMAT
#to_s
#to_html
AS_HTML\|FORMAT
#write_html_to
#write_format_to
AS_HTML\|FORMAT
AS_HTML\|FORMAT
#write_to
#native_write_to
FORMAT
FORMAT
#to_format
#serialize
FORMAT
#write_format_to
#write_to
FORMAT
As long as neither
AS_XML
andAS_XHTML
is set and a node's document is aHTML_DOCUMENT_NODE
, the output will be written as HTML.When output as HTML, the only thing I can see
FORMAT
controlling is whether newlines are added after (some, but not all) elements. The questions are where do we want to modifyXML::Node
to perform HTML5 serialization and do we want to preserve this FORMAT default?In particular
#inner_html
should probably follow the standard for serialization by default which means no additional newlines.#write_to
is the natural place to patch but as you can see from the table,#to_html
and#write_html_to
both addFORMAT
.pre
,listing
, andtextarea
ignore leading newlines!The parsing rules for these elements says that if the token following their start tag is a line feed, then the line feed is ignored "as an authoring convenience."
As an informative example in the standard, after parsing
and then serializing and reparsing, the
pre
element's text content isHello.
.I'm inclined to follow this behavior for
#inner_html
but it's likely surprising to clients that well-formed HTML doesn't round-trip through serialization and reparsing.What should we do here? Should some of these follow the standard? Should we introduce a new API?