iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
50 stars 72 forks source link

Escape redirect URLs in RealCDXExtractorOutput #36

Closed gerhardgossen closed 9 years ago

gerhardgossen commented 9 years ago

The classes does not escape the URLs it gets from the HTTP headers / the HTML meta tags. This makes the resulting CDX files invalid if the redirect URL contains spaces (see e.g. https://github.com/internetarchive/ia-hadoop-tools/issues/4). This commit fixes that by passing the resolved URL through java.net.URI's multi-argument constructor which escapes the individual parts appropriately.

anjackson commented 9 years ago

This looks good. Can you also add a note to the CHANGES.md file that summarises the change?

gerhardgossen commented 9 years ago

Updated CHANGES.md

anjackson commented 9 years ago

Thanks, looks great.