jsdom / jsdom

A JavaScript implementation of various web standards, for use with Node.js
MIT License
20.45k stars 1.7k forks source link

How disable the processing of CSS? #2005

Open cawa-93 opened 7 years ago

cawa-93 commented 7 years ago

Basic info:

Minimal reproduction case

If I try parse html contain

<style type="text/css">
@import url("/css/blok_add.css");
</style>

I get the following error:

Error: Could not parse CSS @import URL /css/blok_add.css relative to base URL "about:blank"
    at scanForImportRules (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\helpers\stylesheets.js:68:25)
    at exports.evaluateStylesheet (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\helpers\stylesheets.js:33:3)
    at HTMLStyleElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\HTMLStyleElement-impl.js:23:5)
    at HTMLTableDataCellElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Node-impl.js:281:15)
    at HTMLTableDataCellElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Element-impl.js:93:11)
    at HTMLTableRowElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Node-impl.js:281:15)
    at HTMLTableRowElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Element-impl.js:93:11)
    at HTMLTableSectionElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Node-impl.js:281:15)
    at HTMLTableSectionElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Element-impl.js:93:11)
    at HTMLTableElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Node-impl.js:281:15)
    at HTMLTableElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Element-impl.js:93:11)
    at HTMLDivElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Node-impl.js:281:15)
    at HTMLDivElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Element-impl.js:93:11)
    at HTMLTableDataCellElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Node-impl.js:281:15)
    at HTMLTableDataCellElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Element-impl.js:93:11)
    at HTMLTableRowElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Node-impl.js:281:15)
    at HTMLTableRowElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Element-impl.js:93:11)
    at HTMLTableSectionElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Node-impl.js:281:15)
    at HTMLTableSectionElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Element-impl.js:93:11)
    at HTMLTableElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Node-impl.js:281:15)
    at HTMLTableElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Element-impl.js:93:11)
    at HTMLDivElementImpl._attach (D:\Develop\parser\node_modules\jsdom\lib\jsdom\living\nodes\Node-impl.js:281:15)

The problem can be solved by specifying the parameter url.

new JSDOM(html, {url: 'http://example.com'})

However, I'm wondering if it's possible to completely disable the processing of CSS?

domenic commented 7 years ago

It's not possible at the moment. You can suppress errors using your own VirtualConsole though.

The correct thing to do is indeed to set a base URL.

pawel-dubiel commented 7 years ago

The ability to disable certain parts of jsdom features could help optimize performance. And when scrapping content the CSS processing is certainly not needed.

cawa-93 commented 7 years ago

@pawel-dubiel, Right. In my case, I'm only interested in httml and manipulation with it. I do not need the processing of the CSS and javascript. I would like to disable all this, to improve the speed of work.

pawel-dubiel commented 7 years ago

@cawa-93 I'm just thinking if you don't need to use JS/CSS and you don't need DOM API you may consider using cheerio (which should be a few times faster, or at least it was a few years ago ).

domenic commented 7 years ago

Do you have any evidence of a performance increase? With my knowledge of jsdom's architecture, there shouldn't be much, if any, since it's all done lazily.

pawel-dubiel commented 7 years ago

@domenic 8x is a claim from https://github.com/cheeriojs/cheerio

"Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM."

And this old screencast https://vimeo.com/31950192 which at the end compares both jsdom and cheerio performance, but it's 6 years old.

It would be really interesting to see results from 2017.

But both projects have a different scope.

domenic commented 7 years ago

Yes, those Cheerio results are really dishonest, as we've mentioned to their author before. And they certainly have nothing to do with CSS.

mikegleasonjr commented 6 years ago

I'm in the camp where cssom fails to parse my CSS but I still need to parse the js.

dontcallmedom commented 6 years ago

I too would like to disable inline CSS parsing - it is mostly a waste of CPU and memory in my various usages of jsdom.

domenic commented 6 years ago

https://github.com/tmpvar/jsdom/issues/2005#issuecomment-333922999

mikegleasonjr commented 6 years ago

For me is having to write more code to gracefully handle CSS parsing "errors" that is bothering me. Even if my CSS is valid (It is just not supported by JSDOM's parser).

EDIT: maybe I should try again with the newer versions

dontcallmedom commented 6 years ago

Here is a script which, when run along with a downloaded copy of https://www.w3.org/TR/html52/single-page.html takes around 6 seconds less on my machine when evaluateStylesheet is commented out in HTMLStyleElement-impl.js:

const {JSDOM} = require("jsdom");
console.time("DOM parsing");
JSDOM.fromFile("single-page.html").then(dom => {
  console.timeEnd("DOM parsing");
});
domenic commented 6 years ago

That's great evidence; thanks!

KonradLinkowski commented 5 years ago

Has something changed since January 22th?

domenic commented 5 years ago

https://mobile.twitter.com/slicknet/status/782274190451671040

minas90 commented 5 years ago

We have the same issue. We parse hundreds of pages per minute and I tried to remove css from most of them manually and parsing time decreased drastically, also we get bunch of Error: Could not parse CSS. Please make parsing of the css optional. What's the best way to manually fork and disable css parsing?

sdgandhi commented 4 years ago

Fork with stylesheet parsing disabled: https://github.com/dfblue/jsdom

jikkujose commented 4 years ago

Quite surprised this wasn't thought of when implementing it! Processing just raw HTML seems to be a very common case when CSS is just a burden to consider.

EtzBetz commented 3 years ago

I would like to see an option to disable CSS processing as well..

jeffRTC commented 3 years ago

Just hit the wall.

@domenic I'm not a communist, but let me ask you one thing. How much do you need to implement this feature?

I can write off a Cheque directly to you.

ahmadsdn commented 3 years ago

Does this disabling parsing CSS really that complicated? Processing is CSS really not necessary and actually wasteful for a lot of use cases. It's been more that 3 years now since this issue has been raised! I wish it was resolved by now ☹

eedahl commented 2 years ago

Same boat, I just need the HTML and something CSS-related is failing.

kfitzgerald commented 2 years ago

Same – And in my case, jsdom is used as a dependency from other libs, so the option to switch to another dom provider just to disable css parsing is not terribly viable.

phgn0 commented 2 years ago

For anyone else stumbling here, I created yet another fork with the no-CSS patches ontop of the latest JSDOM version: https://github.com/phgn0/jsdom-no-css.

Just use it as "jsdom": "phgn0/jsdom-no-css#master", in your package.json (or fork it).

weijarz commented 2 years ago

I currently use this to get around it:

new JSDOM(html.replace(/<style(\s|>).*?<\/style>/gi, ''))
uuf6429 commented 2 years ago

Since there are already at least two forks with a fix, I assume it's not relatively difficult to implement, so what's stopping this from progressing @domenic?

Disclaimer: I've also just encountered this problem and in my use-case, I do not need CSS parsing.

rudolfbyker commented 1 year ago

Here's a workaround that uses monkey-patching. I think this is easier than keeping forks around:

/**
 * Workaround for https://github.com/jsdom/jsdom/issues/2005
 */
export function disableCssProcessing() {
  const HTMLStyleElementImpl =
    require("jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl").implementation;
  HTMLStyleElementImpl.prototype._updateAStyleBlock = () => {};
}
khmyznikov commented 1 year ago

import version of previous patch:

import { implementation } from 'jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js';
implementation.prototype._updateAStyleBlock = () => {};
attacomsian commented 12 months ago

I currently use this to get around it:

new JSDOM(html.replace(/<style(\s|>).*?<\/style>/gi, ''))

It didn't work for me. Instead, I found another regex that works for both inline styles and scripts.

I now use the following code to remove inline CSS and scripts:

const sanitizeHtml = html => {
  return html?.replace(/<style([\S\s]*?)>([\S\s]*?)<\/style>/gim, '')?.replace(/<script([\S\s]*?)>([\S\s]*?)<\/script>/gim, '')
}

let doc = new JSDOM(sanitizeHtml(html), { url })
eric-hemasystems commented 10 months ago

Another reason to disable is for cases where JSDom is unable to parse the CSS. For example it doesn't seem to be able to handle CSS nesting.

Even if that support is added there will always be future changes to CSS that JSDOM will be catching up to. Providing the ability to disable that parsing when not needed allows that issue to be avoided.

thescientist13 commented 9 hours ago

Just ran into the same issue as above regarding CSS nesting support, but was able to apply the HTMLStyleElement patch override.