iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
49 stars 72 forks source link

urls with spaces unescaped #58

Open ghost opened 8 years ago

ghost commented 8 years ago

With a badly configured redirect it's possible to arrive at a url with unescaped spaces in the name; eg:

uk,nhs,wales)/sites3/docopen.cfm?637545f2-1143-e756-5c8403609089cb40&id=18400&orgid=268 20090729144455 http://www.wales.nhs.uk/sites3/docopen.cfm?orgid=268&ID=18400&637545F2-1143-E756-5C8403609089CB40 text/html 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ C:\Documents and Settings\All Users\Documents\My Pictures\Sample Pictures\CLHG building.JPG - 349 21033214 EA-TNA0709.www.nhs.uk-20090729135204-01475.arc.gz

Clearly this has gone completely wrong and the underlying record is unusable, but the fault in this record also prevents parsing of the CDX file. In this situation the CDX generation code might be better checking for and escaping spaces in the redirect url, while emitting a warning that the record is broken.

ghost commented 8 years ago

Just realised that this is a particular case of issue #37