tcowans / owasp-java-html-sanitizer

Automatically exported from code.google.com/p/owasp-java-html-sanitizer
Other
1 stars 0 forks source link

Recognize URLs in <img srcset> #20

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
http://www.w3.org/html/wg/drafts/srcset/w3c-srcset/ describes an extension 
attribute to HTML <img> elements that allows multiple annotated URLs.

Make sure the URl protocol policy applies to all of them.

Original issue reported on code.google.com by mikesamuel@gmail.com on 21 Jan 2014 at 3:59

GoogleCodeExporter commented 9 years ago
> An image candidate string consists of the following components, in order, 
with the
> further restrictions described below this list:
> 1. Zero or more space characters.
> 2. A valid non-empty URL referencing a non-interactive, optionally animated,
>  image resource that is neither paged nor scripted.
> 3. Zero or more space characters.

I believe (3) is a spec error and should read "One or more", as the spec is 
ambiguous as written unless you assume all matching is greedy left-to right and 
that the additional restrictions cause backtracking so that

   foo10w,bar10h

is equivalent to

    foo 10w , bar 10h

We need to test around commas at the end of and inside URLs:

foo,,bar
foo, bar
foo,bar , baz

Comma is a sub-delim in RFC 3986, but should be safe to re-encode (%2c) in URIs.

Even given that commas are properly guarded, "valid non-empty URL" ( 
http://dev.w3.org/html5/spec-LC/urls.html ) could allow internal white-space, 
so we will need to enforce additional restrictions and re-encode all potential 
white-space inside URLs to guard against URL splitting attacks like

    http://foo/10w,javascript:alert(1337) 10h

being interpreted as 2 URLs.

Original comment by mikesamuel@gmail.com on 21 Jan 2014 at 4:50