ryanfb / iiif-dl

Command-line tile downloader/assembler for IIIF endpoints/manifests
MIT License
31 stars 2 forks source link

Any plans for respecting institutions that don't want their stuff harvested in this way? #4

Closed mattmcgrattan closed 8 years ago

mattmcgrattan commented 8 years ago

Perhaps via Robots.txt? Or a similar mechanism?

ryanfb commented 8 years ago

I'll look into adding this, though it will add a certain amount of complexity (right now I'm getting away without using any gems and thus Bundler). It looks like robotstxt-parser might do most of what would be needed, outside of maybe Crawl-delay (Googling now, it seems like robotex provides this). I'd probably want to match against iiif-dl as the agent name (making sure all the requests also send that to the server might be another issue I need to look into).

In the meantime, I would suggest that in many cases what such an institution may want is something that should eventually be provided by the IIIF authentication API: http://iiif.io/api/auth/0.9/

That is, if you're providing full-resolution images to unauthenticated users, but expecting them not to be able to download them, you might want to think hard about what you're really trying to do…

azaroth42 commented 8 years ago

@ryanfb:

That is, if you're providing full-resolution images to unauthenticated users, but expecting them not to be able to download them, you might want to think hard about what you're really trying to do…

Except you're scraping tiles, and not requesting the full region, full size image. Why are you doing that, if not to work around institutions' desire to not provide the full image for download?

You're also not respecting the tile sizes from the server's info.json for the image, and thus likely filling up caches with unwanted images. There's a lot that the script could do to be more polite :)

mattmcgrattan commented 8 years ago

We don't provide full access. If you do /full/full/ you'll get a 1000 pixel image. You don't get the full thing.

Your code is using tile requests to get around that.

And auth won't stop that, as we'd have to auth every tile request, not just requests to download the full image, since you are building images from tile requests.

Respecting robots.txt would be good practice, as would using a consistent agent name, because that could be filtered or treated differently by service providers.

Matt


From: Ryan Baumann [notifications@github.com] Sent: 23 May 2016 17:40 To: ryanfb/iiif-dl Cc: Matthew McGrattan; Author Subject: Re: [ryanfb/iiif-dl] Any plans for respecting institutions that don't want their stuff harvested in this way? (#4)

I'll look into adding this, though it will add a certain amount of complexity (right now I'm getting away without using any gems and thus Bundler). It looks like robotstxt-parserhttps://github.com/gjtorikian/robotstxt-parser might do most of what would be needed, outside of maybe Crawl-delay (Googling now, it seems like robotexhttps://github.com/chriskite/robotex provides this). I'd probably want to match against iiif-dl as the agent name (making sure all the requests also send that to the server might be another issue I need to look into).

In the meantime, I would suggest that in many cases what such an institution may want is something that should eventually be provided by the IIIF authentication API: http://iiif.io/api/auth/0.9/

That is, if you're providing full-resolution images to unauthenticated users, but expecting them not to be able to download them, you might want to think hard about what you're really trying to do…

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHubhttps://github.com/ryanfb/iiif-dl/issues/4#issuecomment-221026617

ryanfb commented 8 years ago

Except you're scraping tiles, and not requesting the full region, full size image. Why are you doing that, if not to work around institutions' desire to not provide the full image for download?

Trying to keep institutions honest. Stitching tiles is trivial (as demonstrated by this script). I was actually inspired to write this script by this thread on IIIF Discuss.

Frankly, I'm tired of breathless press releases about providing high-resolution images to the public, with no download link (now with added breathlessness about IIIF standards). The IIIF standard lets anyone request the tiles, unauthenticated, at a given resolution, and assemble them, so I wrote something to do that and save me (and anyone else) the hassle of going to a page and max-zooming then assembling the image(s) from my browser's cache/requests (though looking through IIIF Discuss for discussion about this topic has alerted me to the existence of dezoomify, a tool which does just that).

If you want to prevent unauthenticated users from downloading high-resolution images…you probably shouldn't serve unauthenticated users high-resolution images. This script using robots.txt and an agent name won't change that.

Except you're scraping tiles, and not requesting the full region, full size image. Why are you doing that, if not to work around institutions' desire to not provide the full image for download?

There are cases where an image server might not want to return a full-size image for /full/full/ due to technical reasons but is perfectly fine with image region requests, e.g. a 10 GigaPixel astronomical image.

You're also not respecting the tile sizes from the server's info.json for the image, and thus likely filling up caches with unwanted images. There's a lot that the script could do to be more polite :)

Again, an issue of added complexity when I very quickly put this script together in whatever way let me get something that worked with the least hassle. I've made a separate issue for this in #5.

I'll note that I'm happy to receive Pull Requests.

And auth won't stop that, as we'd have to auth every tile request, not just requests to download the full image, since you are building images from tile requests.

Yes, you would probably want to auth every tile request. I believe this is covered under "Content Resources" in the Authentication API proposal. However, if you only provide pointers to degraded versions of Content Resources to unauthenticated users accessing your Description Resources, you'd probably cover about 99% of realistic cases. If the Authentication API proposal doesn't meet your needs, I'd really suggest giving your feedback to them now, because as far as I know this is exactly the kind of scenario the proposal is supposed to address.

ryanfb commented 8 years ago

For what it's worth, since I've had multiple people express backchannel concern over this, I'd like to try to offer some clarification:

  1. I do plan on implementing robots.txt support, ideally with Crawl-delay and an agent name. However, writing, testing, and documenting this code will take some amount of time. Same for the info.json issue in #5.
  2. I'm not personally using this tool to do large-scale manifest/image crawls. I've only ever tested this tool on one-off manifests to verify that it works for a handful of servers/images. If you're seeing abuse on your IIIF image servers from wget, it's not from me.
  3. As a result of 2, me adding code to support robots.txt won't prevent anyone else from disabling it or writing their own tool.

I didn't realize this was so controversial, given that I'm mostly echoing what I've seen for this topic before on IIIF Discuss.

@azaroth42: do you plan on adding the same functionality to azaroth42/iiif-harvester?

azaroth42 commented 8 years ago

Done. Tag, you're it :)

ryanfb commented 8 years ago

Thanks, that gave me something to crib from.