overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
104 stars 9 forks source link

Add option to limit recursion of external outlinks to 1-level deep #17

Closed devnoname120 closed 1 year ago

devnoname120 commented 1 year ago

It would be convenient to have an option to only save the first level links of external outlinks, but no level restriction for internal outlinks.

The internal/external separation could be set with a new option that expects a RegExp (just like -o and -x).

I can't figure an easy way to do that so far.

The only solution I have in mind (but didn't try) is:

overcast07 commented 1 year ago

You could try combining -a accesskey:secret -d capture_outlinks=1 with -o https?://mywebsite\.com (will require archive.org S3 keys). The script's own options aren't capable of only traversing a single level of outlinks, but the API's options are only capable of traversing a single level of outlinks, so in theory combining them would allow for what you're describing.

Note that since proper support hasn't been implemented in the script for -d capture_outlinks=1, you won't be able to get any data about the external outlinks, and you will probably end up capturing the internal outlinks more times than necessary.