leizongmin / js-xss

Sanitize untrusted HTML (to prevent XSS) with a configuration specified by a Whitelist
http://jsxss.com
Other
5.19k stars 630 forks source link

Preserve text content (document data) for ignored tags (removing all child tags). #203

Open josundt opened 4 years ago

josundt commented 4 years ago

Currently, I cant find a way to preserve the inner text content for non-whitelisted (ignored) tags.

I am using xss to sanitize HTML content in a browser application, f.ex. when the user pastes HTML code into an HTML content rich text editor component.

It is crucial that an HTML sanitizer preserves the document's data (text content), while removing the unwanted markup code (keeping only allowed tags, attributes and styles).

For non-whitelisted tags, there should be some way to get the plain text content between start and end tags (similar to DOM HTMLElement's innerText property in browser).

Example:

<table>
  <tbody>
    <tr>
      <td>Cell 1</td>
      <td>Cell 2</td>
    </tr>
  </tbody>
</table>

To illustrate my point (with a somewhat obscure example), let's say table is not in the whitelist while all the other tags from the example above are whitelisted. It makes little sense to preserve the child elements without the parent table element - this would be invalid html code that would not render properly. You may say that my example whitelist is bad, but still it is important to preserve the document "data" - the cell contents - even if all formatting is lost. This is not possible with xss today.

In my opinion, the most consistent way to sanitize non-whitelisted tags and guarantee sanitized output that is valid HTML, is to remove the start and end tag, and to strip all html code between the start and end tags for the element.

To illustrate what I mean, I show an example that you can run in a browser developer tools console:

var htmlContent = "<table><tbody><tr><td>Cell1</td><td>Cell2</td></tr></tbody></table>"; // Same as HTML above
var div = document.createElement("div");
div.innerHTML = htmlContent;
console.log(div.innerText);
// => "Cell1Cell2"

This close to what I want, but it would be more correct if one or more space characters separated the text content of child elements.

My feature request - additional supported value for the stripIgnoreTagBody parameter: (should not introduce any breaking change):

false|null|undefined //by default: do nothing
'*'|true             //filter out all tags not in the whitelist
['tag1', 'tag2']     // filter out only specified tags not in the whitelist
+ { // object 
+     keepTextContent: true | string[]      
+     // true: remove all child tags, preserve text content
       // false: remove content including text
       // array: remove all child tags, preserve text content, except for tags in array, remove all
+ }

I think it should be fairly easy to implement removal of all html tags between start and end tags. There's already an example of how to remove all HTML tags using xss in the documentation.

From the xss documentation to remove all HTML tags:

var source = "<strong>hello</strong><script>alert(/xss/);</script>end";
var html = xss(source, {
  whiteList: [], // empty, means filter out all tags
  stripIgnoreTag: true, // filter out all HTML not in the whitelist
  stripIgnoreTagBody: ["script"] // the script tag is a special case, we need
  // to filter out its content
});