iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
49 stars 72 forks source link

Make regular expression to extract URLs from CSS more restrictive #63

Closed sebastian-nagel closed 7 years ago

sebastian-nagel commented 7 years ago

(allow only ", ', \" or \' in front of or after the URL). Avoid long-runners when matching the regex due to heavy back-tracking.

See commoncrawl/ia-web-commons#2 for more details.

MohammedElsayyed commented 7 years ago

I added this test case

"url(''i261.photobucket.com'')","i261.photobucket.com"

inside ExtractingParseObserverTest.testHandleStyleNode, but it didn't work as expected:

expected:<i261.photobucket.com> but was:<'i261.photobucket.com'>
junit.framework.ComparisonFailure
    at junit.framework.Assert.assertEquals(Assert.java:81)
    at junit.framework.Assert.assertEquals(Assert.java:87)
    at org.archive.resource.html.ExtractingParseObserverTest.checkExtract(ExtractingParseObserverTest.java:91)
    at org.archive.resource.html.ExtractingParseObserverTest.testHandleStyleNode(ExtractingParseObserverTest.java:52)

Do you think you could double-check it and let me know your findings? I might be mistaken.

sebastian-nagel commented 7 years ago

Hi @MohammedElsayyed, good point! The CSS spec does only allow one quotation mark (single or double). However, the parser should be able to extract also multiply quoted CSS links. That there are still quotes around the URL is not a problem of the extraction pattern but of the cleansing code which removes only one pair of quotes. I'll try to improve the code. Thanks!

sebastian-nagel commented 7 years ago

Hi @MohammedElsayyed, your test case "url(''i261.photobucket.com'')","i261.photobucket.com" should now be covered, I've added a similar one. The pr includes #57 which also affects the stripping of quote characters.

MohammedElsayyed commented 7 years ago

It worked perfectly normal. But I am not sure about the following test case:

"url('foo.gif'')","foo.gif"

If it could happen, then it should be handled. I included it because I noticed that # of apostrophes before and after URL is not the same as stated in the main issue commoncrawl/ia-web-commons#2

body {
     background-image: url('''... (524288 apostrophes in total!) ...'''http://i261.photobucket.com/albums/ii67/iglup/l10/1280/foto_74967ac2.jpg'''... (515996 apostrophes in total!) ...'''
sebastian-nagel commented 7 years ago

Stripping of quotes was done pairwise (one leading, one trailing quote) before. It would be even easier to strip leading and trailing quotes independent whether they are paired. I'll update the pull request. Thanks!

sebastian-nagel commented 7 years ago

Also unpaired quotation marks are now stripped. Thanks!

MohammedElsayyed commented 7 years ago

Thank you, Mr. @sebastian-nagel. It worked as expected. Well done!