philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.05k stars 155 forks source link

When trying to add a child to a node with Floki.traverse_and_update/2, previous children of the HTML are being added recursively, and then closed all at once in the end #357

Closed alfredbaudisch closed 3 years ago

alfredbaudisch commented 3 years ago

Description

Consider the basic HTML below, where I want to find each <h1> header, add anchors and generate a table of contents, by adding an id attribute and prepend a child <a href/> anchor into the <h1>, for each <h1>:

  <h1 name="foo">First Section<h1>
  Content content
  <h2>Inner-Title</h2>
  Content Content
  <h1>Second Section</h1>

I'm using Floki.traverse_and_update/2 and then matching {"h1", attrs, children}, where I extract the inner <h1> text from children, generate the id and add to attrs and add a <a/> into children.

Result

The problem is that the previous h1 is being added to the next one and so forth, and in the end, they are all nested and closed at once. For the example above, the HTML was closed with 3x </h1> at the end:

<h1 id="first-section" name="foo"><a href="#first-section" class="anchor-link"></a>First Section
<h1 id="content-content"><a href="#content-content" class="anchor-link"></a>
Content content
<h2>Inner-Title</h2>
Content Content
<h1 id="second-section"><a href="#second-section" class="anchor-link"></a>Second Section</h1></h1></h1>

Notice how the ids and anchors were added, but the headers are not closed in their correct position. It also made an anchor the text line "Content content" which is not a header.

To Reproduce

Steps to reproduce the behavior:

Floki.parse_fragment!(html)
|> Floki.traverse_and_update(fn
  {"h1", attrs, children} = el ->
    case find_node_text(children) do
      nil -> el
      text ->
        id = Slug.slugify(text)
        attrs = [{"id", id} | attrs]
        anchor = {"a", [{"href", "#" <> id}, {"class", "anchor-link"}], []}
        {"h1", attrs, [anchor | children]}
    end

  el ->
    el
end)
|> Floki.raw_html()

# Find the header text
defp find_node_text([child | children]) when is_binary(child) and child != "",
  do: if(String.match?(child, ~r/[<>]+/), do: find_node_text(children), else: child)
defp find_node_text([_ | children]), do: find_node_text(children)
defp find_node_text(_), do: nil

Expected behavior

<h1 id="first-section" name="foo"><a href="#first-section" class="anchor-link"></a>First Section</h1>
Content content
<h2>Inner-Title</h2>
Content Content
<h1 id="second-section"><a href="#second-section" class="anchor-link"></a>Second Section</h1>
alfredbaudisch commented 3 years ago

Duck debugging: by writing the issue I noticed I wasn't closing the first header: <h1 name="foo">First Section<h1>.