redpanda-data / connect

Fancy stream processing made operationally mundane
https://docs.redpanda.com/redpanda-connect/about/
8.14k stars 840 forks source link

strip_html() bloblang function is encoding HTML entities #662

Open davidshrive opened 3 years ago

davidshrive commented 3 years ago

Hey Ash,

The strip_html() bloblang function seems to be encoding html entities which was unexpected. For example running the following json document:

{
  "body": {
    "content": {
      "text": "Dublin, Feb.  08, 2021  (GLOBE NEWSWIRE) -- The \"The Global Market for Metamaterials 2021\" report has been added to ResearchAndMarkets.coms offering.\nMetamaterials applications will represent a multi-billion market within the next decade with product advances in radar and lidar for autonomous vehicles, telecommunications antenna, 5G networks, coatings, vibration damping, wireless charging, noise prevention and more."
    }
  }
}

Through this bloblang...

body.content.text = body.content.text.strip_html()

Returns...

{
  "body": {
    "content": {
      "text": "Dublin, Feb.  08, 2021  (GLOBE NEWSWIRE) -- The "The Global Market for Metamaterials 2021" report has been added to ResearchAndMarkets.coms offering.\nMetamaterials applications will represent a multi-billion market within the next decade with product advances in radar and lidar for autonomous vehicles, telecommunications antenna, 5G networks, coatings, vibration damping, wireless charging, noise prevention and more."
    }
  }
}

Where you can see the quotes (") have been replaced with their encoded values (&#34)

I've managed to replicate this issue in the latest benthos (3.40.0).

Do you agree this is a bug or is this the intended behaviour?

Thanks

Jeffail commented 3 years ago

Hey @davidshrive, this is an unfortunate consequence of the underlying library being intended primarily for sanitizing text to insert within HTML. I think given this method has use cases beyond sanitation then it's unideal to perform entity escapes as you can't be certain that this.foo.strip_html().unescape_html() is equivalent to the original text, but in the case where you want sanitized output you'd be able to do this.foo.strip_html().escape_html().

The change we'd need to the lib used is very simple, add a policy parameter that's referenced here: https://github.com/microcosm-cc/bluemonday/blob/master/sanitize.go#L363 in order to disable the escapes. I'm not sure if the maintainers would be interested in having that addition merged upstream but we can ask. The other more serious problem though is that existing users might be relying on the current behaviour of strip_html(), so we'd either need to add another optional parameter, add a new method with a different name (yuck), or wait for Benthos v4.

davidshrive commented 3 years ago

Thanks @Jeffail

text.strip_html().unescape_html() does indeed solve the problem.

Optional param seems like a sensible solution to me but low priority now I know the above sorts it 👍