machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"
https://warcreate.com
MIT License
206 stars 13 forks source link

CSS not captured in some cases #68

Closed machawk1 closed 9 years ago

machawk1 commented 9 years ago

Test site: http://bl.ocks.org/mbostock/1353700 uses:

<style>
@import('./cssFile.css');
</style>
machawk1 commented 9 years ago

On https://github.com/machawk1/warcreate/blob/master/js/content.js#L183 the list of stylesheets pulls from document.stylesheets, which has 1 items containing 3 rules consisting of the @import statement and two rules defined inline. If a rule is @import, we need to fetch the CSS file referenced.

machawk1 commented 9 years ago
<html>
<head>
<link rel="stylesheet" href="dummyURI.css" type="text/css" />
<style>
body {color: red;}
@import('./cssFile.css');
</style>
</head>
<body>
Text
</body>
</html>

would yield an array length of two: one for the import CSS, one for the inline style tag.

machawk1 commented 9 years ago

The "type" of the rule can be observed when iterating through each defined. See https://developer.mozilla.org/en-US/docs/Web/API/CSSRule . @import is type 3. Standard CSS style rules are type 1.

machawk1 commented 9 years ago

This might mean we're also not getting CHARSET_RULE types, FONT_FACE_RULE types, etc.

machawk1 commented 9 years ago

Current implementation has us getting the warc-request, req header , warc-resp, and resp header sans body

machawk1 commented 9 years ago

Bug on console: Error in event handler for (unknown): TypeError: Cannot read property 'length' of null at chrome-extension://oahhpkadldedkbfoooakcbicohedmljh/js/content.js:185:60 185 refers to:

for(var rules=0; rules<document.styleSheets[ss].rules.length; rules++){

Reproducible at nasa.gov

machawk1 commented 9 years ago

In NASA's case, the CSS is loaded via a remote_loader.js and thus has no rules, as it likely is parsed differently by the browser. Despite this, the browser recognizes the included file as a CSSStyleSheet (isn't that a bit redundant). Compare the below.

CSSStyleSheet {
  cssRules: CSSRuleList
  disabled: false
  href: null
  media: MediaList
  ownerNode: style
  ownerRule: null
  parentStyleSheet: null
  rules: CSSRuleList
  title: null
  type: "text/css"
  __proto__: CSSStyleSheet
}

CSSStyleSheet {
  cssRules: null
  disabled: false
  href: "http://search.usa.gov/assets/sayt.css"
  media: MediaList
  ownerNode: link
  ownerRule: null
  parentStyleSheet: null
  rules: null
  title: null
  type: "text/css"
  __proto__: CSSStyleSheet
}
machawk1 commented 9 years ago

Reopening. The original CSS file from http://bl.ocks.org/mbostock/1353700 has HTTP Response headers but no payload

machawk1 commented 9 years ago

The absolute() function is storing the data at index "http:/style.css?20120730". Something is awry in this function in content.js.