Closed simonw closed 1 year ago
https://www.w3.org/TR/2011/WD-html5-20110405/rendering.html#display-types suggests more tags that should have their content removed:
[hidden], area, base, basefont, command, datalist, head,
input[type=hidden], link, menu[type=context], meta, noembed, noframes,
param, rp, script, source, style, track, title { /* [case-insensitive](https://www.w3.org/TR/2011/WD-html5-20110405/rendering.html#case-insensitive-selector-exception) */
display: none;
}
Might be neat to swap images for their alt=
text too.
Idea: if you specifically target one of these tags as a selector - eg strip-tags script
- then it doesn't have its content removed.
Just spotted this: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text
As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of
<script>
,<style>
, and<template>
tags are generally not considered to be ‘text’, since those tags are not part of the human-visible content of the page.
I'm going to drop that idea about script
and style
being allowed if they were listed in the selectors then.
These aren't visible on the page so they should be removed as well.