overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
101 stars 9 forks source link

handling the domains excluded from the Wayback Machine #13

Closed ghost closed 1 year ago

ghost commented 1 year ago

the website reports:

Sorry.

This URL has been excluded from the Wayback Machine.

it appears that currently the script reports excluded websites as successful submissions, for instance:

spn.sh https://www.nohomers.net/ 

outputs (timestamps and the data folder stripped):

Created data folder ~/spn-data/
 [Job submitted] https://www.nohomers.net/
 [Job completed] https://www.nohomers.net/

the internal archive.org 'block list' reports errors (contains at least the said archive.org domain):

This URL is in the Save Page Now service block list and cannot be captured. Please email us at "info@archive.org" if you would like to discuss this more.

brandongalbraith commented 1 year ago

Consider a mechanism or option to make an archival request API call to another service that does not exclude sites, like https://archive.ph/ or similar in these situations.

overcast07 commented 1 year ago

Information about the block list isn't provided by the Save Page Now service. In SPN there are some sites where archiving works fine and no error messages are shown, but the domain itself is actually blocked and you can't view the captures. For example, if you archive a Dropbox download link ending in ?dl=1, SPN won't report any problems, and the capture of the URL that the submitted URL redirects to will be viewable, but certain parts of the main dropbox.com domain are blocked so the capture of the actual submitted URL won't be viewable. We could infer from this that the captures are actually kept in spite of not being visible, but I don't know for sure.

There is no archive.ph API, and the site maintainer (it's literally just one person running it) almost certainly doesn't plan to add one. Bot stuff is actively discouraged on that site and you can get a CAPTCHA if you submit more than a few links. (That site also does have a block list, and I don't think there are any large archival sites that wouldn't have a block list.)

Maybe you could check if the URL is blocked using the CDX API before every capture but it's not necessarily something everyone would want to enable. If the captures are actually being stored, maybe some people would consider that to be a successful archival of the content despite the captures not being visible.