rrrene / html_sanitize_ex

HTML sanitizer for Elixir
MIT License
271 stars 62 forks source link

Whitespace for images #52

Closed sb8244 closed 1 year ago

sb8244 commented 3 years ago

I found a bit of a tricky issue with stripping HTML -> plaintext. If I have an image separating 2 paragraphs, they are pushed right up against each other without a space. For example:

iex(3)> HtmlSanitizeEx.strip_tags("<p>I'm a paragraph.</p><img src='xx' /><p>I'm another.</p>")                              
"I'm a paragraph.I'm another."

I would expect there to be a space between them. Is it best to do something like this?:

iex(4)> "<p>I'm a paragraph.</p><img src='xx' /><p>I'm another.</p>" |> String.replace("<img", " <img") |> HtmlSanitizeEx.strip_tags()    
"I'm a paragraph. I'm another."

This use case (html to plaintext) might be fraught with a bunch of pitfalls that exist no matter what, but I'm just not seeing yet.

rrrene commented 1 year ago

I apologize for the age/inactivity on this issue. I should have done a better job at resolving this properly. 😥

This particular issue is due to the used HTML parser and can not be solved by this library, unfortunately.

Please feel free to re-open this issue at your discretion.