philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.05k stars 155 forks source link

raw_html encodes HTML where it should not #161

Closed grych closed 6 years ago

grych commented 6 years ago

158 fixed the issue with re-encoding HTML entities, but introduced the new issue I am facing in my project: it encodes them where it shouldn't:

Example:

iex(1)> Floki.parse("<script>console.log(\"text\");</script>") |> Floki.raw_html  
"<script>console.log(&quot;text&quot;);</script>"

Expected (this is what we can observer in 0.18):

iex(2)> Floki.parse("<script>console.log(\"text\");</script>") |> Floki.raw_html  
"<script>console.log(\"text\");</script>"
philss commented 6 years ago

Hi, @grych! Thank you for the report, and sorry for the delay.

This is definitely a bug. I think we shouldn't encode/decode strings that are inside <script> tags. Avoiding that should fix the problem.

I will take a look on this soon.

grych commented 6 years ago

Thanks! BTW it is not only an issue with <script>.

iex(5)> Floki.parse("<a>\"in a\"</a>") |> Floki.raw_html 
"<a>&quot;in a&quot;</a>"

Shouldn't raw_html/1 preserve the text in the tag?

davydog187 commented 6 years ago

@philss I think this bug is more pervasive than what you're suggesting. This behavior is both inconsistent as well as incorrect.

Inline <style> tags also get their inner css styles escaped, which can break css rules. For example

iex(5)> text = "<style>.myclass[data-attr=\"stuff\"] { width: 100%; }</style>"
"<style>.myclass[data-attr=\"stuff\"] { width: 100%; }</style>"
iex(6)> Floki.parse(text) |> Floki.raw_html
"<style>.myclass[data-attr=&quot;stuff&quot;] { width: 100%; }</style>"

The other problem I see is that the intention of #159 was to keep input/output consistent, but its still breaking it when you don't want values escaped.

davydog187 commented 6 years ago

Can I propose we add second argument to raw_html/1 that allows us to configure whether we escape or not?

Related, right now Floki exposes raw_html/2, which is an implementation detail, and should not be exposed to the user. @philss @grych if you agree with this approach I'd be happy to put in a PR

davydog187 commented 6 years ago

Another alternative is to not escape the insides of <style> or <script> as you mentioned, but I fear that there may be more use cases than what we're currently considering

grych commented 6 years ago

Cool, configurable escape would solve my issue.

philss commented 6 years ago

Hi @davydog187! Thank you for the input.

I did the later suggestion, which is to ignore the <script> and <style> tags. I think this is not enough, and like you mentioned, may not cover all cases. But fix the major cases.

Can I propose we add second argument to raw_html/1 that allows us to configure whether we escape or not? Related, right now Floki exposes raw_html/2, which is an implementation detail, and should not be exposed to the user. @philss @grych if you agree with this approach I'd be happy to put in a PR

I think it would be great. Go for it, please.

davydog187 commented 6 years ago

Thanks @philss, I'll put in a PR after #164 is merged

philss commented 6 years ago

@davydog187 @grych #164 is merged and published now. Thanks!

philss commented 6 years ago

I will let this issue open until the optional encode for raw_html.

davydog187 commented 6 years ago

Thanks I’ll take a stab tomorrow! On Thu, Jan 25, 2018 at 6:57 PM Philip Sampaio notifications@github.com wrote:

I will let this issue open until the optional encode for raw_html.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/philss/floki/issues/161#issuecomment-360640702, or mute the thread https://github.com/notifications/unsubscribe-auth/AA-PSTm-GtSah4ff6HD5li8VJHp_smehks5tORTigaJpZM4RXqwy .

davydog187 commented 6 years ago

@philss this should be closeable now. @grych can you verify that this addresses all of your concerns?

I think we need a new Floki release

grych commented 6 years ago

Thanks guys! config :floki, :encode_raw_html, false did the trick. After the release I'll update Drab's dependencies.

philss commented 6 years ago

Hi @grych, @davydog187 . The feature was released under version 0.20.0. Thanks!

jtomaszewski commented 6 years ago

Can I add a note in Changelog that it's a breaking change? This broke our app, after updating floki from 0.18 to 0.20.1 .

I remember I looked at the floki's changelog, but I thought all the changes were minor. We needed to add config :floki, :encode_raw_html, false config to bring back old behaviour.

philss commented 6 years ago

Hi, @jtomaszewski ! Sorry for the delay. I didn't see the message. Please go ahead. Thanks!