Download HTML with archive.org URLs for assets

jsvine / waybackpack

Download the entire Wayback Machine archive for a given URL.

MIT License

2.8k stars 189 forks source link

Download HTML with archive.org URLs for assets #37

Closed alexgarciab closed 4 years ago

alexgarciab commented 4 years ago

Is it possible to add an argument to be able to download the HTML of an archived page from archive.org, with the https://web.archive.org full path for assets?

For example, right now, an image is being referenced in the HTML as:

<img src="/assets/image-sprites.png">

as it only downloads the .html, that image is broken.

What I am asking is to be able to download the HTML with full URL paths such as:

https://web.archive.org/web/20130909175810im_/https://domain.com/assets/image-sprites.png

This would apply to all the external resources: Images, CSS, JS etc.

That way, I would be able to see the page, exactly as I am seeing it from Wayback Machine page.

jsvine commented 4 years ago

Thanks for your interest in this library. If the Wayback Machine's CDX API supports returning HTML with those full URL paths, then it should be relatively easy to add to the library. Do you know if it does? If it doesn't, then it'd likely require a bit more thinking / labor.

alexgarciab commented 4 years ago

So I have been reading about the CDX API. Apparently, it just returns resources URLs, and then it would be a matter of replacing the "relative path" URLs from the .html downloaded file with the URLs returned from the CDX API. But I guess, they should have something to be able to connect these 2 APIs, so that you can download html pages with full resources URIs.

jsvine commented 4 years ago

Yes, that does seem to be the case. Thank you for looking into it. Rather than waybackpack making assumptions about how, exactly, a user would want those URIs transformed, I think such post-processing might work best outside of the library. Closing this issue for now, but feel free to reopen with alternative suggestions.