S3 identifier prefix - Githubissues

acdha commented 5 years ago

I'm not sure whether this an update for the documentation or a feature request but I wanted to setup an instance of RAIS which only serves images from S3 and ended up having to read the source to learn that I have to prefix the IIIF image identifier with s3:. It's not the end of the world but it feels like an implementation detail I'd prefer not to leak into public URLs.

jechols commented 5 years ago

Okay, this is poor indeed. The documentation needs to be updated at a minimum.

I think the thought process was having a prefix would mean an easier time having a single service run from S3 or the filesystem based on the prefix. I'll have to rethink this, because even in our own usage this isn't something that's going to happen.

jechols commented 5 years ago

@acdha What would your ideal setup be for identifiers? Just passing them straight to S3 (but with no prefix)? Having some kind of a translation map so the public representation can't be correlated with the actual S3 path? Custom configuration that lets you map prefixes to sources?

Of course it's worth noting that if you're already planning on using s3fs, this is somewhat moot, as you won't need to use the S3 plugin.

acdha commented 5 years ago

My goal is to get rid of S3FS as quickly as possible — it's proven to be unstable under load and requires care to keep the cache from expanding until it fills the partition and fails — so I'm focused on servers which can handle S3 directly.

My immediate need is just the simplest approach: pass the identifier straight to S3 and 404 if the key doesn't exist, with the thought that multiple buckets could be handled by running multiple copies. However, I was just testing running without a rewriting proxy (my target being a deployment on AWS Fargate with an ALB which supports routing paths but not rewriting them for the backend requests) and that strategy appears not to be possible since it seems to require that the path always be hostname/iiif. That might push towards your “Custom configuration that lets you map prefixes to sources” idea so it could be something like collection1:… or collection2/… so the backend configuration could change without breaking the logical organization of the public URLs.

jechols commented 5 years ago

Interesting.

Real quick: the path is configurable, but static. So you could set it up to /foo, but all requests would still have to start with /foo, and everything thereafter is considered by RAIS to be part of the asset's ID. (There's also a hard-coded endpoint for a rudimentary admin API, by default bound to a different port, and another endpoint for the deepzoom protocol, but you shouldn't need to customize either of those). Our docker-compose demos require a set path so that the HTML that presents the viewers can be fairly static.

Our use-case sounds very similar to yours, and I'm thinking we'll run into the same troubles you have. Given that, I think I'm going to file some tickets (hopefully tomorrow - it's getting late). If you have thoughts, I'd love to hear 'em, but these are obviously our problems, not yours.

Issue #19 sounds like a must for an S3 setup if we expect a lot of traffic (the current S3 plugin stores files to disk which would give you the same problem as s3fs -- though we have a hidden configuration value for expiring the cache, it hasn't been tested with heavy traffic. hmm, maybe we could have a max file size setting... though streaming as-needed, with an optional cache may still be a far safer option)
The S3 plugin should read the IIIF id to allow for bucket overrides; e.g., to get at bucket foo and object bar, the IIIF ID might bes3:foo:identifier
- The configuration for bucket becomes optional and only used when no bucket is in the ID
An ID mapping plugin gets created which reads some kind of very simple configuration such that you can replace any prefix with any other.
- Want collection 1 to read from the configured tile path? Just map collection1: to `.collection1:path%2Fto%2Fassettranslates topath%2Fto%2Fasset, which is found on disk at/var/local/images/path/to/asset`
- Collection 2 is from S3's collection-2 bucket? Map collection2: to s3:collection-2:
- Collection 3 is actually under collection 2's bucket but you don't want to specify its full path? collection3: => s3:collection-2:sub/path or something crazy.

RAIS still selects reader strategy based on IDs, but those who need or want it can keep the ids opaque. And keeping the ability to have multiple types of reader based on ID means a single configuration can be used for any number of RAIS instances.

acdha commented 5 years ago

Real quick: the path is configurable, but static. So you could set it up to /foo, but all requests would still have to start with /foo, and everything thereafter is considered by RAIS to be part of the asset's ID.

I just figured out why this didn't work in my testing — I had a trailing slash on RAIS_IIIFURL:

docker run --rm -it --env-file=.env -e RAIS_S3ZONE=us-east-1 -e RAIS_S3BUCKET=ndnp-batches -e RAIS_LOGLEVEL=DEBUG -e RAIS_ADDRESS=":12415" -e RAIS_S3CACHE=/tmp/rais-s3 -e RAIS_IIIFURL=http://localhost/iiif/2/ -p 80:12415 uolibraries/rais

docker run --rm -it --env-file=.env -e RAIS_S3ZONE=us-east-1 -e RAIS_S3BUCKET=ndnp-batches -e RAIS_LOGLEVEL=DEBUG -e RAIS_ADDRESS=":12415" -e RAIS_S3CACHE=/tmp/rais-s3 -e RAIS_IIIFURL=http://localhost/iiif/2 -p 80:12415 uolibraries/rais

acdha commented 5 years ago

Issue #19 sounds like a must for an S3 setup if we expect a lot of traffic (the current S3 plugin stores files to disk which would give you the same problem as s3fs -- though we have a hidden configuration value for expiring the cache, it hasn't been tested with heavy traffic. hmm, maybe we could have a max file size setting... though streaming as-needed, with an optional cache may still be a far safer option)

I think that might be safer, especially if there's a cache for the image metadata so the reads could be sensibly bounded and hitting info.json won't always trigger access. Historically on ChronAm caching has been useful but less effective than anticipated because there's such a long tail of infrequently-accessed content, and that problem gets worse for a clustered deployment unless you have a robust sticky session implementation.

One of my to-test points has been using the ECS task storage in my deployment because that's shared but it's also only 4GB so until https://github.com/aws/containers-roadmap/issues/53 is resolved I'm not sure how much benefit it'd really offer.

acdha commented 5 years ago

RAIS still selects reader strategy based on IDs, but those who need or want it can keep the ids opaque.

This seems like the most important point to me: there's obvious value in having at least some basic prefix-mapping capability but I would like to keep the IIIF URLs stable if the backend is rearchitected.

jechols commented 5 years ago

Real quick: the path is configurable, but static. So you could set it up to /foo, but all requests would still have to start with /foo, and everything thereafter is considered by RAIS to be part of the asset's ID.

I just figured out why this didn't work in my testing — I had a trailing slash on RAIS_IIIFURL:
docker run --rm -it --env-file=.env -e RAIS_S3ZONE=us-east-1 -e RAIS_S3BUCKET=ndnp-batches -e RAIS_LOGLEVEL=DEBUG -e RAIS_ADDRESS=":12415" -e RAIS_S3CACHE=/tmp/rais-s3 -e RAIS_IIIFURL=http://localhost/iiif/2/ -p 80:12415 uolibraries/rais
docker run --rm -it --env-file=.env -e RAIS_S3ZONE=us-east-1 -e RAIS_S3BUCKET=ndnp-batches -e RAIS_LOGLEVEL=DEBUG -e RAIS_ADDRESS=":12415" -e RAIS_S3CACHE=/tmp/rais-s3 -e RAIS_IIIFURL=http://localhost/iiif/2 -p 80:12415 uolibraries/rais

That is incredibly poor design on our part. Another ticket is incoming. A trailing slack shouldn't matter.

jechols commented 5 years ago

I think that might be safer, especially if there's a cache for the image metadata so the reads could be sensibly bounded and hitting info.json won't always trigger access. Historically on ChronAm caching has been useful but less effective than anticipated because there's such a long tail of infrequently-accessed content, and that problem gets worse for a clustered deployment unless you have a robust sticky session implementation.

FYI, by default we cache the info.json for 10,000 images. Because that info is so tiny, it's like a 10 meg RAM hit, which is nothing. Of course, reading that data is also very cheap when operating on local disk, but for S3 that could be a big deal.

Tile caching is also an option, but it's obviously not going to have a great cache hit rate for the amount of RAM needed.

jechols commented 4 years ago

This is fixed in develop. The wiki being unversioned, though, there are no docs at the moment to point you to.

I'm hoping for a 4.0.0 release fairly soon. Until then, the basic use can be seen in the rais-example.toml file if you pull down develop: https://github.com/uoregon-libraries/rais-image-server/blob/develop/rais-example.toml#L49

uoregon-libraries / rais-image-server

S3 identifier prefix #20