Clarify transformers documentation

Some of the behaviour of transformers is not clear from reading the documentation.

Can you specify both :transformers and :transformers_breadth in a single clean call? If so, in what are order will the transformers be called?
"[Depth-first traversal means] that markup is traversed from the deepest node upward (not from the first node to the last node)". This implies that given this: <html><body><div></div><p><a></a></p>, <a> will be the first node processed as it's the deepest. Is this true? I suspect this is instead supposed to mean that the document will be traversed in order, but children will be processed before their parents, therefore <div> would be first.
There is no definition of "breadth-first" and the example does not adequately demonstrate the concept. With the code <html><body><div><a></a></div><p><b></b></div>, which gets processed first - <a> or <p>?
"Transformers allow you to filter and modify nodes using your own custom logic, on top of (or instead of) Sanitize’s core filter. [...] Any changes made to the current node or to the document will be reflected instantly in the document and passed on to subsequently-called transformers and to Sanitize itself. " Can you add nodes? Will the added nodes be passed to other transformers and to Sanitize itself?

Great questions! I'll answer them here for you, and I'll leave this bug open until I get a chance to update the docs.

Can you specify both :transformers and :transformers_breadth in a single clean call? If so, in what are order will the transformers be called?

Yes, you can specify both :transformers and :transformers_breadth in a single clean call. Breadth transformers will be run first, then depth transformers will run after all breadth transformers have finished. The node whitelist is shared between both sets of transformers, so if a breadth transformer whitelists a node, that node will still be whitelisted when depth transformers run.

"[Depth-first traversal means] that markup is traversed from the deepest node upward (not from the first node to the last node)". This implies that given this: <html><body><div></div><p><a></a></p>, <a> will be the first node processed as it's the deepest. Is this true? I suspect this is instead supposed to mean that the document will be traversed in order, but children will be processed before their parents, therefore <div> would be first.

Yes, your suspicion is correct: the <div> would be processed first, followed by the <a>, then the <p>. Sorry for the unclear terminology there. I imagine there's a better term to describe that than simply "depth-first". I'll give it some thought.

There is no definition of "breadth-first" and the example does not adequately demonstrate the concept. With the code <html><body><div><a></a></div><p><b></b></div>, which gets processed first - <a> or <p>?

In this case, a breadth-first transformer would see <div>, then <a>, then <p>, then <b>. Note that the <html> and <body> elements would be ignored when using clean(), since it expects a fragment rather than a full document and would discard them.

"Transformers allow you to filter and modify nodes using your own custom logic, on top of (or instead of) Sanitize’s core filter. [...] Any changes made to the current node or to the document will be reflected instantly in the document and passed on to subsequently-called transformers and to Sanitize itself. " Can you add nodes? Will the added nodes be passed to other transformers and to Sanitize itself?

Yes, with the caveat that, in a depth-first transformer, if you modify a section of the tree that has already been traversed, your modification won't be seen by subsequent transformers because traversal is forward-only and doesn't backtrack to pick up changes. So it's your responsibility to either restrict your changes to the current node or not-yet-traversed nodes if you want them to be seen by subsequent transformers.

Changes made by breadth-first transformers in sections of the tree that have already been traversed will not be seen by subsequent breadth-first transformers, but will be seen by depth-first transformers, since the two sets of transformers each run their own traversal step.

Hope this helps clarify things!

Your definition of depth- and breadth-first do not match what Wikipedia says. It says a depth-first search would do the parents before their children (div, p, a in the example above). It says a breadth-first search would do all nodes on the same level one after another, starting from the top (div, p, a, b).

in a depth-first transformer, if you modify a section of the tree that has already been traversed, your modification won't be seen by subsequent transformers because traversal is forward-only and doesn't backtrack to pick up changes. So it's your responsibility to either restrict your changes to the current node or not-yet-traversed nodes if you want them to be seen by subsequent transformers.

This doesn't match what I'm seeing, or at least depends on your definition of "already been traversed".

The intention of my transformer is to replace line breaks with <br>s. I run it as a "depth-first" transformer. It processes text nodes, looks for a linebreak, modifies the text node to the text before the linebreak, then adds a sibling <br> and then a sibling text node containing the text after the linebreak. It doesn't appear that the new text node I add gets processed, so the result is that only the first linebreak gets replaced.

        fix_whitespace = lambda do |env|
            node = env[:node]
            return unless node.text?
            # remove leading whitespace if this is the first child or the previous child was a block (and so would would put us on a new line anyway)
            node.content = node.content.lstrip if node.previous_sibling.nil? or (!node.previous_sibling.description.nil? and node.previous_sibling.description.block?)
            # like above, but for trailing whitespace
            node.content = node.content.rstrip if node.next_sibling.nil? or (!node.next_sibling.description.nil? and node.next_sibling.description.block?)
            return if node.text.empty?
            return unless node.text.include?("\n")
            resulting_nodes = replace_text_with_node(node, "\n", Nokogiri::XML::Node.new('br', node.document))
        end

        config = Sanitize::Config::BASIC.merge({
            #:transformers => [linkify_urls, linkify_styles, linkify_users, yes_follow, fix_whitespace]
            :transformers => [fix_whitespace]
        })
        Sanitize.clean(text, config).html_safe

# (snip)

    def replace_text_with_node(node, text, node_to_insert)
            original_content = node.text
            start = node.text.index(text)
            # the stuff before stays in the current node
            node.content = original_content[0, start]
            # add the new node
            node.add_next_sibling(node_to_insert)
            # the stuff after becomes a new text node
            node_to_insert.add_next_sibling(Nokogiri::XML::Text.new(original_content[start + text.size, original_content.size], node.document))
            return [node.next_sibling, node.next_sibling.next_sibling]
    end

I had worked around this by recursively calling clean_node! on the new nodes at the end of my transformer

            resulting_nodes.each do |new_node|
                Sanitize.clean_node!(new_node, env[:config])
            end

But I run into a "stack level too deep" error when there's a large amount of linebreaks in the original text.

Your definition of depth- and breadth-first do not match what Wikipedia says. It says a depth-first search would do the parents before their children (div, p, a in the example above). It says a breadth-first search would do all nodes on the same level one after another, starting from the top (div, p, a, b).

I adopted the terminology from Nokogiri, on which Sanitize is based, but I agree that it's inaccurate. "Bottom up" would probably be a better way of describing the default traversal mode, and "top down" better describes the opposite.

In "bottom up" the first node traversed is the deepest child of the shallowest leftmost node, followed by its parent, etc., then the deepest child of the next leftmost shallowest node, and so on.

In "top down" the first node traversed is the shallowest leftmost node, followed by its first child, its first child, and so on, backtracking up and continuing in a top-down left-to-right order.

I'll change the terminology in a future version of Sanitize.

It doesn't appear that the new text node I add gets processed, so the result is that only the first linebreak gets replaced.

You're right; it looks like adding siblings relative to the current node doesn't result in those siblings being traversed, since Nokogiri doesn't update the traversal NodeSet with the new sibling. I'll open a new bug for this. Sorry about that!

rgrove / sanitize

Clarify transformers documentation #90