internetarchive / wayback

IA's public Wayback Machine (moved from SourceForge)
721 stars 131 forks source link

Rewrite percent-encoded URLs #94

Open Arkiver2 opened 8 years ago

Arkiver2 commented 8 years ago

Currently percent encoded URLs are not rewritten. For example, the text from https://web.archive.org/web/20150804131701/http://blip.tv/file/get/NostalgiaCritic-NCPlanetOfTheApes401.m4v?showplayer=2014093037100220150422135039&referrer=http://blip.tv&mask=11&skin=flashvars&view=url should be rewritten like: Original:

message=http%3A%2F%2Fj41.video2.blip.tv%2F5520014255207%2FNostalgiaCritic-NCPlanetOfTheApes401.m4v%3Fir%3D96428%26sr%3D2334 
_Should be rewritten as:_
message=http%3A%2F%2Fweb.archive.org%2Fweb%2F20150804131701%2Fhttp%3A%2F%2Fj41.video2.blip.tv%2F5520014255207%2FNostalgiaCritic-NCPlanetOfTheApes401.m4v%3Fir%3D96428%26sr%3D2334 
kngenie commented 8 years ago

Hmm, key question is whether to rewrite any URL look-alike in HTML, and how generally useful it can be. Whether %-encoded or not is an minor issue here.

If HTML rewriter rewrote any URL look-alike in HTML, not just URL in attributes, it'd rewrite any textual mention of URL in HTML pages. I don't think that's the right thing to do in general. So rewriting %-encoded URL in this case is highly specific to this case. Unfortunately wayback does not have a mechanism of applying rewrite rules specific to particular URL at this moment.

Arkiver2 commented 8 years ago

With the next generation of the Wayback Machine, https://blog.archive.org/2015/10/21/grant-to-develop-the-next-generation-wayback-machine/, will the possibility be added to add special URL rewrite rules for certain URLs?

EDIT: Same question for ignoring/removing some custom query strings with special rules, for example timestamps, forums session IDs.