philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.05k stars 155 forks source link

How to filter out empty paragraphs using Floki #160

Closed newterminator closed 6 years ago

newterminator commented 6 years ago

I have spent the last couple hours trying different combinations to get rid of empty paragraphs in a html string input. The str is "<p><br></p><p>hello</p><p>now is the <b>time</b></p><p><b><br></b></p>"

The final string I am trying to extract is "<p>hello</p><p>now is the <b>time</b></p>"

The closest I am was with Floki.filter_out(str, "br") |> Floki.raw_html

I would appreciate any help on this.

philss commented 6 years ago

Hi @newterminator! I'm trying to think in a way to solve this.

One way would be to remove the empty nodes, with a new pseudo-class selector :empty (it is not implemented yet). But according to the spec, this would not remove the nodes that contain break lines.

I'm trying to think about selectors that could work in combination with the ":not" pseudo-class selector too. Seems that we don't have existing CSS selectors for that, which would be an opportunity to introduce a new one.

newterminator commented 6 years ago

HI @philss, Thank you for your input. I was aware of the :empty selector. I will look around in JavaScript to see if the frontend can deal with this...but if anyone comes up with a solution on the elixir side, then that will be better, since I can confirm the required output string before saving it to the database. Merry Christmas @philss

mischov commented 6 years ago

Here is a possible solution using Meeseeks, which allows custom selectors for this type of situation.

Create a custom selector

In this case, the selector matches elements that have a text node as a descendant, but one could go further and make sure that the text node contains characters other than whitespace.

defmodule ElementContainingText do
  use Meeseeks.Selector

  alias Meeseeks.Document

  defstruct tag: nil

  # No tag defined, matches any element that has a text node for a descendant
  def match(%ElementContainingText{tag: nil}, %Document.Element{} = element, document, _context) do
    element_contains_text?(element, document)
  end

  # Tag defined, matches elements with that tag that have a text node for a descendant
  def match(%ElementContainingText{tag: target}, %Document.Element{tag: tag} = element, document, _context) when target == tag do
    element_contains_text?(element, document)
  end

  def match(_selector, _node, _document, _context) do
    false
  end

  defp element_contains_text?(element, document) do
    descendants = Document.descendants(document, element.id)
    descendant_nodes = Document.get_nodes(document, descendants)

    Enum.any?(descendant_nodes, &text_node?/1)
  end

  defp text_node?(%Document.Text{}), do: true
  defp text_node?(_), do: false
end

Use the custom selector

iex(1)> html = "<p><br></p><p>hello</p><p>now is the <b>time</b></p><p><b><br></b></p>"
"<p><br></p><p>hello</p><p>now is the <b>time</b></p><p><b><br></b></p>"

iex(2)> selector = %ElementContainingText{tag: "p"}
%ElementContainingText{tag: "p"}

iex(3)> Meeseeks.all(html, selector)                                            
[#Meeseeks.Result<{ <p>hello</p> }>,
 #Meeseeks.Result<{ <p>now is the <b>time</b></p> }>]

iex(4)> Meeseeks.all(html, selector) |> Enum.map(&Meeseeks.html/1) |> Enum.join()
"<p>hello</p><p>now is the <b>time</b></p>"

The example selector might need tweaking for your use case, but custom selectors are nice tools for situations when a selection doesn't fall within the realm of css or xpath.

newterminator commented 6 years ago

Hey @mischov, thanks so much for a detailed writeup. Your solution using your Meeseeks library solves it correctly. I just tested other strings that come in via the input and they all worked. I am going to read up more on the Meeseeks' documentation as well.

Thanks for the Christmas gift of your library and your help on resolving the issue.