shivanshu3 / strip-js

[DEPRECATED] NPM Module which strips out all JavaScript code from some HTML text
MIT License
7 stars 7 forks source link

Remove 'src' attributes from 'script' tags only. Removing script tag … #8

Open Nashorn opened 4 years ago

Nashorn commented 4 years ago

Remove 'src' attributes from 'script' tags only. Removing script tag from alters the DOM tree and css nth-node operations, because it no longer matches the original html document.

For example, in screen scraping where the captured HTML doc needs to match the DOM structure of live site so that the same CSS rules that use nth-level selectors (i.e.: main > div:nth-child(2) > div > div > div:nth-child(3) > ...). or even Javascript querySelector should work both on live DOM as well as scraped copy.

Removing all script tags, even from changes the tree, the above ex selector breaks with null.

FIX:: 2nd, addresses a bug where

action attrb was not being removed due to improper 'domElement' declaration, prefixed with var and references correct element in loop.

Nashorn commented 4 years ago

The solution in the PR will perform a slightly different stripping, but maintain DOM tree:

  1. $('script').remove(); replaced with $('script').html(""); -- to blank out and empty any javascript code
  2. all scripts with src attrb: $('script[src]'), will have their src attrb removed
  3. Fix to remove 'action' attrb from forms

Preserving the script tag to maintain DOM tree is important.

shivanshu3 commented 4 years ago

Hi @Nashorn, thanks for making this change! I'll take a look at it in a couple days.