iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
50 stars 71 forks source link

Prevent from stackoverflow by limiting length of matched pattern #87

Open sebastian-nagel opened 5 years ago

sebastian-nagel commented 5 years ago

The pattern used to match CSS-embedded URLs is not limited, i.e. it matches URLs of any length, potentially causing a Java stack overflow (see commoncrawl/ia-web-commons#12).

This PR fixes the issue and adds a unit test to make it reproducible resp. verify the solution.

ato commented 5 years ago

Looks like this patch also disallows whitespace within the URL? Under the old pattern url('foo bar') matched but with the new pattern it does not match. According to [MDN's documentation](https://developer.mozilla.org/en-US/docs/Web/CSS/url()) whitespace should be allowed if the URL is quoted:

Quotes are required if the URL includes parentheses, whitespace, or quotes, unless these characters are escaped, or if the address includes control characters above 0x7e .