iipc / openwayback

The OpenWayback Development
http://www.netpreserve.org/openwayback
Apache License 2.0
486 stars 275 forks source link

archived https URLs aren't replaying in Ubuntu with some Tomcat versions #398

Closed ldko closed 5 years ago

ldko commented 5 years ago

Discussion about this happened on IIPC Slack. For reference, I am putting some of the details in this issue to go along with the PR #397 opened by @peveikko.

On Ubuntu (not an issue with CentOS and RHEL) in at least some Tomcat versions , OpenWayback is returning Resource Not in Archive for https scheme archived URIs and suggests to search under http://https/www. Same pages do work with http scheme.

@peveikko noted: For https URLs Everything works fine at centos/rhel, but got this behaviour with 3 different ubuntu machines. Also tried with different tomcat/java versions.

@anjackson supplied following: Okay, so I think this is to do with a CVE https://nvd.nist.gov/vuln/detail/CVE-2015-5174 -- I think Tomcat have added some URL clean-up/normalisation, meaning that later versions of Tomcat 6/7/8 may all have the same problem. This doesn't affect http URLs, perhaps because this code reinserts any stripped slash? https://github.com/iipc/openwayback/blob/c49f8e7200870c3af40561f3ca340c67c98db02f/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java#L755-L769 ...Easiest thing might be to modify the WaybackRequest to explicitly support /https:/host/... (assuming I've got this right of course)