Closed newterminator closed 6 years ago
Hi @newterminator! I'm trying to think in a way to solve this.
One way would be to remove the empty nodes, with a new pseudo-class selector :empty
(it is not implemented yet). But according to the spec, this would not remove the nodes that contain break lines.
I'm trying to think about selectors that could work in combination with the ":not" pseudo-class selector too. Seems that we don't have existing CSS selectors for that, which would be an opportunity to introduce a new one.
HI @philss, Thank you for your input. I was aware of the :empty
selector. I will look around in JavaScript to see if the frontend can deal with this...but if anyone comes up with a solution on the elixir side, then that will be better, since I can confirm the required output string before saving it to the database.
Merry Christmas @philss
Here is a possible solution using Meeseeks, which allows custom selectors for this type of situation.
In this case, the selector matches elements that have a text node as a descendant, but one could go further and make sure that the text node contains characters other than whitespace.
defmodule ElementContainingText do
use Meeseeks.Selector
alias Meeseeks.Document
defstruct tag: nil
# No tag defined, matches any element that has a text node for a descendant
def match(%ElementContainingText{tag: nil}, %Document.Element{} = element, document, _context) do
element_contains_text?(element, document)
end
# Tag defined, matches elements with that tag that have a text node for a descendant
def match(%ElementContainingText{tag: target}, %Document.Element{tag: tag} = element, document, _context) when target == tag do
element_contains_text?(element, document)
end
def match(_selector, _node, _document, _context) do
false
end
defp element_contains_text?(element, document) do
descendants = Document.descendants(document, element.id)
descendant_nodes = Document.get_nodes(document, descendants)
Enum.any?(descendant_nodes, &text_node?/1)
end
defp text_node?(%Document.Text{}), do: true
defp text_node?(_), do: false
end
iex(1)> html = "<p><br></p><p>hello</p><p>now is the <b>time</b></p><p><b><br></b></p>"
"<p><br></p><p>hello</p><p>now is the <b>time</b></p><p><b><br></b></p>"
iex(2)> selector = %ElementContainingText{tag: "p"}
%ElementContainingText{tag: "p"}
iex(3)> Meeseeks.all(html, selector)
[#Meeseeks.Result<{ <p>hello</p> }>,
#Meeseeks.Result<{ <p>now is the <b>time</b></p> }>]
iex(4)> Meeseeks.all(html, selector) |> Enum.map(&Meeseeks.html/1) |> Enum.join()
"<p>hello</p><p>now is the <b>time</b></p>"
The example selector might need tweaking for your use case, but custom selectors are nice tools for situations when a selection doesn't fall within the realm of css or xpath.
Hey @mischov, thanks so much for a detailed writeup. Your solution using your Meeseeks library solves it correctly. I just tested other strings that come in via the input and they all worked. I am going to read up more on the Meeseeks' documentation as well.
Thanks for the Christmas gift of your library and your help on resolving the issue.
I have spent the last couple hours trying different combinations to get rid of empty paragraphs in a html string input. The
str
is"<p><br></p><p>hello</p><p>now is the <b>time</b></p><p><b><br></b></p>"
The final string I am trying to extract is
"<p>hello</p><p>now is the <b>time</b></p>"
The closest I am was with
Floki.filter_out(str, "br") |> Floki.raw_html
I would appreciate any help on this.