Caja does not strip downlevel-revealed conditional comments [possible fix included]

GoogleCodeExporter commented 9 years ago

What revision of the cajoler exhibits the problem?  On what browser and OS?

svn HEAD - all platforms

What steps will reproduce the problem?

var example = caja.makeSaxParser({ 
  pcdata : function(x) { if (/vml/.test(x)) throw new Error("not stripping downlevel-revealed conditional") },
});
example('<![if !vml]>foo<![endif]>');

What is the expected output? What do you see instead?

It throws the exception. It shouldn't if caja is to handle (drop) these, 
admittedly, annoying junk tags.

Please provide any additional information below.

http://msdn.microsoft.com/en-us/library/ms537512(v=vs.85).aspx#dlrevealed

these are pretty common in emails, which is the use case I'm looking at here. 

I've "fixed" this issue by changing:

        case '<\!':
          if (!/^\w/.test(next)) {
            if (h.pcdata) {

to: 

        case '<\!':
          if (!/^(\w|\[)/.test(next)) {
            if (h.pcdata) {

In parseCPS function, to catch it along with doctype declarations and the like. 
But of course I'm not sure that's right, having only a cursory understanding of 
the code.

I tried to run the tests, but seemingly did not have enough memory on this 
machine!

Original issue reported on code.google.com by morgan.a...@gmail.com on 19 Jun 2014 at 2:50

GoogleCodeExporter commented 9 years ago

I'm not sure what the issue here is. The stuff is _designed_ to be parsed as 
junk tags by non-IE parsers, and in the particular case of Caja they'll be 
discarded or rewritten by the whitelist. (Also, in general, when there's doubt 
about how Caja should interpret HTML input, we currently prefer to use the 
HTML5 / WHATWG rules.)

Is there a case where Caja fails to sanitize/sandbox/render content properly 
due to this?

Is there a specific use case for Caja's parser you have in mind that would 
benefit? You say you're parsing HTML email, but are you doing something other 
than using the Caja sandbox for the results?

Original comment by kpreid@google.com on 4 Aug 2014 at 5:41

Added labels: Component-Sanitizer
Removed labels: Component-Other

GoogleCodeExporter commented 9 years ago

> The stuff is _designed_ to be parsed as junk tags by non-IE parsers, and in 
the particular case of Caja they'll be discarded or rewritten by the whitelist. 

I'm not sure I was clear, the *tags themselves* are not being discarded. The 
tags are parsed as plain text. That is what I was attempting to illustrate with 
my example. e.g. currently:

var example = caja.makeSaxParser({pcdata : function(x) { console.log(x) } });
example('<![if !vml]>foo<![endif]>');

yields the following console output:

<!
[if !vml]
>
foo
<!
[endif]
>

I would expect just:

foo

> Is there a specific use case for Caja's parser you have in mind that would 
benefit? You say you're parsing HTML email, but are you doing something other 
than using the Caja sandbox for the results?

I am using the sax parser directly, and displaying it's output outside of the 
sandbox. I have html input (emails) with downlevel-revealed comments, which I 
wish to handle as above.

Original comment by morgan.a...@gmail.com on 4 Aug 2014 at 11:34

GoogleCodeExporter commented 9 years ago

Thank you for the clarification; that's definitely a bug.

Original comment by kpreid@google.com on 4 Aug 2014 at 11:59

Changed state: Accepted

svnlabs / google-caja

Caja does not strip downlevel-revealed conditional comments [possible fix included] #1922