svnlabs / google-caja

Automatically exported from code.google.com/p/google-caja
0 stars 1 forks source link

Try using native HTML parsing instead of html-sanitizer.js #1823

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
In order to have faster and smaller Caja, it would be nice if we could skip 
having a HTML parser. There are allegedly tools to create and/or parse an 
“inert” HTML document which would not execute script or render, which we 
could then use as the data source for HtmlEmitter. Take a look at:

    document.implementation.createHTMLDocument
    window.DOMParser

Original issue reported on code.google.com by kpreid.switchb.org on 23 Jul 2013 at 9:06

GoogleCodeExporter commented 9 years ago
The following works in Firefox, but returns null in Chrome and Safari. 
According to MDN, it works on IE and not on Opera.

    new DOMParser().parseFromString('hello world', 'text/html')

The following works in Chrome and Safari, but returns a blank document in 
Firefox (all writes I've tried have no effect). Safari also has the quirk that 
the result of createHTMLDocument includes '<title>undefined</title>'.

    var doc = document.implementation.createHTMLDocument();
    doc.write('hello world');
    doc.close();

So, between these two we might be able to cover all browsers we support.

Original comment by kpreid.switchb.org on 28 Aug 2013 at 8:01

GoogleCodeExporter commented 9 years ago
The above code also needs doc.open() before write(...) or it parses/inserts in 
the <body> context.

Did a quick prototype which removes html-sanitizer.js entirely and replaces 
HtmlEmitter's use of its parser with this technique, and all the tests I've 
tried pass (on Chrome) except that I haven't implemented .innerHTML (which is 
currently based on sanitizing the provided HTML).

Original comment by kpreid.switchb.org on 28 Aug 2013 at 9:25

GoogleCodeExporter commented 9 years ago
before you get too deep, we can't remove html-sanitizer because it's used 
outside caja, and has to support more browsers than ses does. keeping it as 
dead code until it's pulled out into a separate project is ok

Original comment by felix8a on 28 Aug 2013 at 9:31

GoogleCodeExporter commented 9 years ago
The idea here is to avoid using html-sanitizer inside Caja itself, with the 
goal of reduced code size (not including html-sanitizer in the Caja frame 
code), increased speed, and more parsing fidelity — not to discontinue the 
standalone sanitizer.

That said, if this technique worked everywhere, it could replace the parser 
inside html-sanitizer.js too.

Original comment by kpreid.switchb.org on 28 Aug 2013 at 9:38

GoogleCodeExporter commented 9 years ago
Note that since this technique parses the entire document before executing any 
script, it can't be made to handle unbalanced document.write structure, e.g.
    <div>
      foo
      <script>document.write('</div><div>');</script>
      bar
    </div>
This is the same restriction as present in ES5/3 mode for the same reason, but 
it would be new to ES5 mode. 

Original comment by kpreid.switchb.org on 28 Aug 2013 at 11:10

GoogleCodeExporter commented 9 years ago
I'm not currently planning to do further work on this experiment. I'm attaching 
my current draft code (base = r5625) so it can be picked up in the future if 
there's interest.

Original comment by kpreid.switchb.org on 6 Nov 2013 at 6:44

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by kpreid@google.com on 7 Nov 2013 at 9:21