philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.05k stars 155 forks source link

Floki.traverse_and_update with an accumulator #246

Closed Dalgona closed 4 years ago

Dalgona commented 4 years ago

Feature goal

It would be nice if we had Floki.traverse_and_update/3 which signature looks like:

def traverse_and_update(html_tree, acc, fun)

...where acc is the initial value for an accumulator, and fun is a 2-ary function which takes a HTML node and an accumulator, and returns a 2-tuple which contains a new HTML node and a new accumulator.

With this function, we could have a functional approach to traversing and manipulating HTML trees with a state.

Examples

# Example 1: Counting the number of all HTML elements

{:ok, html_tree} =
  Floki.parse_document("""
    <h1>Hello, world!</h1>
    <p>Lorem ipsum <strong>dolor</strong> sit amet</p>
    """)
# ==> {:ok, ["..."]}

{new_tree, count} =
  Floki.traverse_and_update(html_tree, 0, fn
    elem = {name, _attr, _nodes}, acc when is_binary(name) ->
      {elem, acc + 1}

    node, acc ->
      {node, acc}
  end)
# ==> {["..."], 3}
# Example 2: Assigning unique numbers to all <h2> tags
{:ok, html_tree} =
  Floki.parse_document("""
  <h1>The quick</h1>
  <h2>Brown fox</h2>
  <h3>Jumps over</h3>
  <h2>The lazy</h2>
  <h2>Dog</h2>
  """)
# ==> {:ok, [
#   {"h1", [], ["The quick"]},
#   {"h2", [], ["Brown fox"]},
#   {"h3", [], ["Jumps over"]},
#   {"h2", [], ["The lazy"]},
#   {"h2", [], ["Dog"]}]}

{new_tree, _acc} =
  Floki.traverse_and_update(html_tree, 0, fn
    {"h2", attr, nodes}, acc ->
      {{"h2", [{"data-count", to_string(acc)} | attr], nodes}, acc + 1}

    node, acc ->
      {node, acc}
  end)
# ==> {[
#   {"h1", [], ["The quick"]},
#   {"h2", [{"data-count", "0"}], ["Brown fox"]},
#   {"h3", [], ["Jumps over"]},
#   {"h2", [{"data-count", "1"}], ["The lazy"]},
#   {"h2", [{"data-count", "2"}], ["Dog"]}], 3}

I would like to submit a pull request if you think this feature is good for this project.

philss commented 4 years ago

Hi, @Dalgona! Thank you for this suggestion! I think it's a good idea to add this new function :+1: Please go ahead and let me know if you need help.

RichMorin commented 4 years ago

In a stunning example of synchronicity, I just started writing some code that needs something exactly like this pull request provides. In fact, I plan to do something very similar to @Dalgona's Example 2.

Basically, I want to create a table of contents for each outgoing HTML page. To do this, I need to traverse the tree and wrap an "" element around each "<h* ...> element. The name needs to be unique and based on the structure of the page's sections (e.g., 1, 1_1, 1_1_1). The accumulator stores these names, as well as the header text. After the function returns, I can use the accumulator to create the table of contents.

Anyway, I plan to try using @Dalgona's code now and update to the next version of Floki when the new feature has been added. Cool...

-r

Dalgona commented 4 years ago

What a coincidence... 😆

I also had to implement a table of contents plugin for my static website generator project. If you have any difficulty, feel free to take a look at my code for some inspiration.

https://github.com/Dalgona/Serum/blob/v1/master/lib/serum/plugins/table_of_contents.ex

RichMorin commented 4 years ago

Thanks! I looked at your code, but think that my use case differs enough that there will only be slight areas of overlap. That said, I seem to be making good progress in using your new version of traverse_and_update; I'll post a link to my code if and when it seems to be working.

RichMorin commented 4 years ago

I got my TOC code working reasonably well (e.g., Crash Scene Field Reference). The code, though still a bit rough, is available in router_toc.ex

Anyway, thanks to all for their efforts!