Browser executable-less preprocessor

saschanaz commented 6 years ago

ReSpec shouldn't require installing real browsers to use respec2html.js as jsdom supports browser-like DOM manipulation in Node.js. Dropping Nightmare (and so Electron) dependency will allow easier build environment settings on GPU-less CLI environments.

marcoscaceres commented 6 years ago

Definitely. I've hit this on Dreamhost too.

The problem in the past (years ago!) was that JSDOM was super-duper slow for whatever reason - specially on big specs - so it wasn't feasible to use it. That's probably been addressed since then, but someone would need to try it out.

tidoust commented 6 years ago

I thought I'd report here because I've been investigating how to make Respec run in JSDOM (for Reffy). Good news is it's mostly doable. Bad news is it's not exactly straightforward...

The tweaks that I needed to make to JSDOM and Respec in Reffy are parked in a drop-respec branch in Reffy for now. See code and comments in util.js. Here is a summary of problems I encountered and workarounds I found so far:

JSDOM does not support URL.createObjectURL: Respec uses that to create a separate worker for the core/highlight module. Dropping that module solves the problem, but obviously means that the module does not run.
JSDOM does not support document.createRange: Respec uses that in the core/list-sorter module. Dropping that module solves the problem, with the obvious same consequence as above. Also note #1665 related to that module: JSDOM throws an error because the generated CSS selector is invalid.
JSDOM's CSS parser is not necessarily up-to-date: JSDOM does not like the @keyframes and @supports directives in particular and throws an error when parsing the CSS Respec injects. Not sure this really has any real impact, but since I don't need the CSS in Reffy, replacing them with @media directives does the trick.
JSDOM does not support innerText: Given the use in Respec, replacing occurrences of innerText in Respec by textContent solves the problem.
marked behaves strangely in the presence of <pre> blocks when there are no blank lines after the block. For instance, if you look at this markdown demo, you'll note that the generated HTML for the <pre> tag includes unexpected (and broken) <p> tags. Now, the problem is that Respec drops blank lines before it calls that library. An example of spec where this happens is Netinfo. Interestingly, in "real" browsers, the weird tags magically disappear afterwards and the generated HTML looks clean (but the output of markdownToHtml is wrong). I do not know why. In JSDOM, the generated HTML stays ugly, and extracting the IDL afterwards is problematic. Replacing \n with \n\n in normalizePadding to preserve a blank line fixes the issue.
JSDOM does not support cloning of attributes: This is needed by HyperHTML. In theory, providing a polyfill is easy but constructors and prototypes are shared in JSDOM, so there is no good way to get back to the underlying document when given an attribute to clone that does not belong to one, or at least I could not find a good solution here. This requires tweaking the code of HyperHTML to call createAttributeNS itself when the attribute it wants to clone does not belong to a document.
JSDOM does not support inserAdjacentElement: It's easy to come up with a polyfill.
JSDOM does not support closest: Same thing, it's easy to come up with a polyfill.
JSDOM does not support matchMedia: Given that Reffy does not need that logic, it's easy to mock the interface
JSDOM does not support fetch: Actually, that's fine for Reffy, because I want to control network requests as much as possible and force usage of a local cache when possible.
JSDOM does not support a few visual methods on window: In particular, blur, focus, moveBy, moveTo, resizeBy, resizeTo, scroll, scrollBy, scrollTo are not supported (and a couple of specs call scrollBy). Reffy does not need that logic, so it's easy to mock these methods.
JSDOM does not support IndexedDB: Respec degrades gracefully when IndexexDB is not supported. It still outputs an obscure error message to say that it cannot call open on undefined (where undefined should have been IndexedDB). Detecting whether IndexedDB exists before trying to access it in Respec would allow to report a more useful error message. Also worth noting here: the readyPromise gets rejected but, depending on the spec, the promise may be checked in another tick. When that happens, Node.js reports an UnhandledPromiseRejectionWarning, which is, as far I can tell, a false positive.
JSDOM does not seem to support the whatToShow filter in TreeWalker: As a result, HyperHTML fails to remove the  comments it adds all over the place when it runs. That does not have any consequence.
As far as I can tell, there is no easy way to run code in JSDOM once Respec has started to run, and in particular when the respecIsReady promise is set. As a result, I need to do a bit of polling to wait for the promise to appear so that I can attach to it and detect the end of the generation. Or is there a better way?

With the workarounds and limitations mentioned above, I manage to run Reffy on all W3C specs that Reffy knows about, and the outcome seems good!

saschanaz commented 6 years ago

Thanks for sharing your experience!

Yeah, I also tried JSDOM and faced those limits. Maybe we should change the whole logic. Some personal thoughts:

We now have access to ES modules, ideally we should be able to use the same modules in both browsers and Node.js without JSDOM page load or a separate browser executable.
We currently heavily depends on browser native global DOM methods, I think we can still use JSDOM there if we make sure that we pass JSDOM document object rather than using global variables. Or viperHTML?
Workers. I hope we can load it as a module in Node.js, if worker scripts can be modules... (Update: Yes, a worker script can be a module.)

marcoscaceres commented 6 years ago

@tidoust, I thought Reffy ran against "static" documents? If we had a Reffy web service, ReSpec could simply post a static snapshot directly from the browser without needing to run any transformations on the server.

To rephrase: what is Reffy's input?

tidoust commented 6 years ago

To rephrase: what is Reffy's input?

Reffy's goal is to generate a knowledge graph of specs and an analysis of that graph to create a list of potential anomalies. To generate the knowledge graph, Reffy:

starts from a list of spec URLs (multiple lists, actually: specs-common.json, specs-cg.json, specs-whawg.json.
retrieves information about each spec from the W3C API and Specref, including the URL of the Editor's Draft
fetches the Editor's Draft of the spec
parses the Editor's Draft to extract IDL, references and links. For Editor's Drafts that use ReSpec, this is where Reffy needs to get the generated document, which is currently done using RespecDocWriter.

The input of the Reffy web service that we talked about is slightly different. It would take as input:

the URL of a spec, or a static snapshot of the generated draft
the latest version of the knowledge graph (typically from reffy-reports).

From that, it would simply have to parse the provided spec, and analyses the result against the knowledge graph. No transformation is needed on the server for that (unless the input is the raw source, obviously).

I started to develop such a service (see reffy-service). I haven't had time to test it with POST content, so it's most probably broken for now. It also lacks a README for now. Note the service needs the HTML and the URL of the spec to test. The URL is needed when the spec already exists in the knowledge graph, so that Reffy can remove it from there first before, otherwise the resulting graph would contain two specs that e.g. define the same IDL terms, and analysis would report that as anomalies.

I'm now looking at ways to make the service more robust before we deploy it on one of our servers. The investigation reported here is part of that, as an attempt to reduce the number of dependencies that would have to run on the server.

marcoscaceres commented 6 years ago

@tidoust thanks for clarifying. @saschanaz, do you know if puppeteer runs without needing a virtual screen? If yes, then that addresses GPU-less CLI environments problem in one sense... still a fat dependency, but at least we don't need to refactor code significantly to back port ReSpec to JSDOM.

saschanaz commented 6 years ago

do you know if puppeteer runs without needing a virtual screen?

puppeteer runs in headless mode by default, so yes.

marcoscaceres commented 6 years ago

@tidoust, given that Puppeteer runs without needing the virtual display, is it a viable alternative to JSDOM for Reffy?

tidoust commented 6 years ago

@tidoust, given that Puppeteer runs without needing the virtual display, is it a viable alternative to JSDOM for Reffy?

Sure! I was not reporting these issues as requests to change ReSpec. The problem here is not how ReSpec is written, it is JSDOM not yet supporting enough features for ReSpec. I just wanted to document the current limitations somewhere.

Actually, we still have a virtual screen for now on the server, but maybe that's no longer needed with Puppeteer (it was for nightmare). We just haven't tried without. But that was not the reason why I wanted to get rid of RespecDocWriter, it was mostly to simplify dependencies and avoid having to run a full chromeless browser on the server.

marcoscaceres commented 6 years ago

I just wanted to document the current limitations somewhere.

Ok cool. Sorry I was confused. I appreciate the documentation above also, so thanks for that!

We just haven't tried without. But that was not the reason why I wanted to get rid of RespecDocWriter, it was mostly to simplify dependencies and avoid having to run a full chromeless browser on the server.

Understood. Ok, if there is anything else we can do, happy to try to help. Reffy looks super cool and if we can integrate somehow directly (via POST or whatever), happy to help with that however we can!

saschanaz commented 5 years ago

@marcoscaceres This has always been my ultimate purpose, for binary-less lighter dependencies, a scriptable API, easier test without karma, and easier debug. Could we bake an experimental build with minimal features to check what should be done, although I'm currently not sure which feature to start?

Edit: My vague thought:

We create a structure that represents ReSpec document, which includes Document object, user configuration, definition maps, biblio maps, etc.
Expose an API function that receives Document and user configuration, creates the structure, and passes it to modules. For the current in-HTML script injection use, ReSpec calls it internally with the current document and respecConfig value.
Modules will receive the structure, replacing the current use of global document and the conf argument. Normalizing configuration shouldn't modify the input configuration to prevent unexpected side effect.
Modules that store its state in module level should instead store it in the structure.
Modules that add event listeners, or do anything cannot be stored statically, should instead insert an inline script to do it. (#1668)

marcoscaceres commented 5 years ago

Excited to see where this ends up!

saschanaz commented 5 years ago

I got a blocker from hyperHTML, which cannot be imported on node.js. Maybe we have to fork viperhtml (which is a lot lighter than hyperhtml but only for server) to make it work.

marcoscaceres commented 5 years ago

Pinged WebReflection for guidance.

WebReflection commented 5 years ago

I got a blocker from hyperHTML, which cannot be imported on node.js

hyperHTML is 100% code covered via node, all you need to do is to write require('basichtml').init() before requiring hyperhtml.

I've used basichtml to serve output even in cloudflare so everything you need should be there.

However, viperHTML is always an alternative if you need just plain SSR 👋

saschanaz commented 5 years ago

So it looks like require('basichtml').init() defines document globally? That somehow clashes with jsdom... But thanks, TIL basichtml.

WebReflection commented 5 years ago

@saschanaz if you have jsdom, and it pollutes the global scope, hyperHTML should work already.

If you don't you can use an empty .init() or specify various things, including the object to pollute.

i.e.

const window = {};
require('basichtml').init({window});

const {customElements, document, HTMLElement} = window;
customElements.define('hello-world', class extends HTMLElement {
  connectedCallback() {
    this.textContent = 'Hello world!';
  }
});

// workaround to execute with a temporary global document
const $ = fn => {
  const oldDocument = global.document;
  global.document = document;
  const result = fn();
  if (oldDocument)
    global.document = oldDocument;
  else
    delete global.document;
  return result;
};

const {bind} = $(() => require('hyperhtml'));

$(() => {
  bind(document.documentElement)`
  <head>
    <title>This is basicHTML</title>
  </head>
  <body><hello-world/></body>`;
});

console.log(document.toString());
/*
<!DOCTYPE html><html><head>
    <title>This is basicHTML</title>
  </head><body><hello-world>Hello world!</hello-world></body></html>
*/

but then again, if you have already a document everything should just work, you just need to require hyperhtml after 👋

saschanaz commented 5 years ago

I mean require("basichtml").init(); require("jsdom"); fails. Maybe jsdom detects browser environment by checking global existence of document or something...

Currently I decided to use viperhtml for template string and parse the result with jsdom again, as I'm currently trying to remove any global or module level states. It's still a draft, so maybe I could migrate to basichtml later if it looks great enough.

WebReflection commented 5 years ago

I mean require("basichtml").init(); require("jsdom"); fails.

that is not a real-world use case, but even if you need to do that, you can pass an object instead, so nothing gets polluted, and jsdom will work.

Currently I decided to use viperhtml for template string and parse the result with jsdom again, as I'm currently trying to remove any global or module level states

You are contradicting yourself, since jsdom pollute the global scope, right?

It's still a draft, so maybe I could migrate to basichtml later if it looks great enough.

Sure, whatever works best. I don't even know what is your actual goal (had no time to read this whole thread, I just jumped in at "I got stuck due node" when all I do is using node for everything I develop on the web 😁)

saschanaz commented 5 years ago

since jsdom pollute the global scope

AFAIK it doesn't, does it?

WebReflection commented 5 years ago

actually my bad, it doesn't. I've seen tests written with global.window = new JSDOM but apparently it doesn't indeed pollute anything by its own.

Then again, neither does basichtml, it's just the handy default that never, personally, hurts.

In regards of hyperHTML it needs to do some feature detection requiring a document in a way or another.

saschanaz commented 5 years ago

@tidoust Hi! Unfortunately we found that #2187 introduces too many changes and creates additional maintenance burden to us. (Even when it might give some hypothetical wins 👀)

I want to hear your current need again to decide the future of #2187. Could you use respec2html now that we depend on headless Chromium? If you can't, what would be the blockers?

tidoust commented 5 years ago

Hello @saschanaz,

I want to hear your current need again to decide the future of #2187. Could you use respec2html now that we depend on headless Chromium? If you can't, what would be the blockers?

I don't think that there is anything blocking us from using respec2html in practice. It was more a matter of reducing dependencies to large binaries (also not to have to run a virtual screen on the server, but that's no longer needed with a headless Chromium).

The one thing that I do now, and which I don't know if I can easily do if we switch to respec2html, is to hook my own implementation of "fetch" to force the use of cached versions of resources so that I can run crawls offline (for testing purpose). That is not a requirement though, and I'm sure I'll find a way to do that.

I totally understand the maintenance burden argument. I actually considered using a forked version of Respec with #2187 baked in... and decided against it for the very same reason.

+@dontcallmedom who was the one advocating for the switch to JSDOM in the first place.

dontcallmedom commented 5 years ago

I agree it'd a nice to have; if it's too costly to maintain, then we'll absorb the cost in running with headless chromium.

saschanaz commented 5 years ago

Okay, that means this is now officially low-priority unless we find an easier way.

@marcoscaceres I still want to keep #2187 open to 1) extract some good thing from there 2) maybe just for me to experiment things. Sounds good?

marcoscaceres commented 5 years ago

Sounds good. Yes, let’s grab whatever goodies we can from #2187.

saschanaz commented 4 years ago

A new possibility: someone made a JS-to-binary compiler that also supports WASM.

Not sure it's any valid, though.

speced / respec

Browser executable-less preprocessor #1469