overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
104 stars 9 forks source link

Add option to only capture direct outlinks #16

Closed manu-cyber closed 1 year ago

manu-cyber commented 1 year ago

It would be super cool if the script had a flag to only capture "first level" outlinks instead of following them recursively.

For example I use this script to archive the articles and blog posts I read. I want any direct references from the post to also be captured, but not to follow them any further where the script might just end up trying to capture half the internet.

The script is a little too complex for me to do this ad-hoc and I’m also not entirely sure whether this would be an additional flag (like -o1or something), or if this should be a specifying flag for -o (e.g. --depth or --levels).

I’d love your input on this, and maybe you have some pointers for me so I can have a go at this when I have a little more time. If you think the option would be useful and are quicker to implement it yourself I wouldn’t be mad either ;)

Thank you for sharing this script publicly either way :)

overcast07 commented 1 year ago

At the moment the script isn't really structured in a way that would allow that to be implemented trivially, but if you have an Internet Archive account you can use -a accesskey:secretkey -d capture_outlinks=1 (accesskey:secretkey being the S3 API keys of the account) to use the "Save outlinks" option that's present on the website. This will cause all first-level outlinks to be captured. However, the script won't log any of the data from the captures of the outlinks at the moment.

manu-cyber commented 1 year ago

Oh sweet, that works perfectly for my usecase, thanks for the hint. In that case I wouldn’t really need an explicit flag for that, so for my part you can close this issue if you like.

Thanks for your prompt response.

TheTechRobo commented 1 year ago

Note that according to SPN's public API docs (would share a link but am at school), the number of outlinks returned normally is up to 1,000, but the maximum amount of outlinks captured if you set capture_outlinks is 250 (if i remember correctly). So if a webpage has several hundred outlinks, this option might miss some.

overcast07 commented 1 year ago

I don't think I can remember it ever capturing that many outlinks when using the option on the website, although I have occasionally noticed in the past that the maximum number of outlinks that it captures has changed (e.g. from 100 to 60). I can't be more specific because I haven't thoroughly checked what the actual behavior is currently.

TheTechRobo commented 1 year ago

Yeah, I dont know exactly how much it is in practice, it never seemed to get that high, but I think 250 is what's shown in the API docs

manu-cyber commented 1 year ago

If the link[0] in the README is correct, it should be a maximum of 100 outlinks. That’s more than enough for a simple blog post or news article :)

From the "Limitations" section:

Max number of outlinks captured using capture_outlinks option = 100

[0] https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit

TheTechRobo commented 1 year ago

Ah yes, those are the docs I was thinking of. Thought it was 250.

overcast07 commented 1 year ago

Because of the way the script is structured, it wouldn't be possible to implement this without either (a) splitting up the collection of outlinks and URLs for failed captures, which are currently combined and handled together, or (b) not collecting outlinks when retrying URLs. The described behavior is already possible using -d capture_outlinks=1 so I don't think it would be worth it to implement this.