simonw / strip-tags

CLI tool for stripping tags from HTML
Apache License 2.0
201 stars 5 forks source link

Strip content of script tags #4

Closed simonw closed 1 year ago

simonw commented 1 year ago

These aren't visible on the page so they should be removed as well.

simonw commented 1 year ago

https://www.w3.org/TR/2011/WD-html5-20110405/rendering.html#display-types suggests more tags that should have their content removed:

[hidden], area, base, basefont, command, datalist, head,
input[type=hidden], link, menu[type=context], meta, noembed, noframes,
param, rp, script, source, style, track, title { /* [case-insensitive](https://www.w3.org/TR/2011/WD-html5-20110405/rendering.html#case-insensitive-selector-exception) */
  display: none;
}
simonw commented 1 year ago

Might be neat to swap images for their alt= text too.

simonw commented 1 year ago

Idea: if you specifically target one of these tags as a selector - eg strip-tags script - then it doesn't have its content removed.

simonw commented 1 year ago

Just spotted this: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of <script>, <style>, and <template> tags are generally not considered to be ‘text’, since those tags are not part of the human-visible content of the page.

I'm going to drop that idea about script and style being allowed if they were listed in the selectors then.