openimagerynetwork / oin-register

Register of participant data providers in the Open Imagery Network.
Creative Commons Attribution 4.0 International
29 stars 6 forks source link

S3 bucket requirements #2

Open smit1678 opened 9 years ago

smit1678 commented 9 years ago

While S3 may not be the only object store that is used within OIN, we're starting with S3 for initial object stores. First pass at a requirements list to bucket setup to be on the register and get indexed:

- S3 bucket
- Have listBucket and getObject rights for public
- Includes geoTIF images
- Includes meta `.json` files

@lossyrob @warmerdam - add, subtract, change?

cc @scisco

lossyrob commented 9 years ago

@smit1678 that sounds right, plus once we figure out where the contribution metadata lives (per #1), either placing that metadata in the register or hosting that metadata at a publicly accessible URI.

Also worth specifying that they are RGB GeoTIFFs...we might want to loosen that requirement later to handle things like RGBA, or specifying files-per-band, but for now I think we can start with "includes RGB GeoTIFF images"

The metadata should be named the same as the GeoTIFF it belongs to (and in the same location), with the only difference being the metadata file ends with .json and the GeoTiff file ends with .tif

smit1678 commented 9 years ago

:+1: Sounds good. Next action here is to drop a technical description into the README or a requirements doc.

jywarren commented 9 years ago

Hi, all - catching up with @ebarry after SotM just now - we've committed to supporting OIN's metadata format (https://github.com/publiclab/mapknitter/issues/178) but I was curious why an object storage method is a requirement? Doesn't the .json file itself provide the URI needed, which would mean that any public-facing HTTP/S interface should be acceptable?

I'm also curious how you find the meta .json files; are you planning on requiring a master JSON listing of available images, provided in the registry? Thanks!

lossyrob commented 9 years ago

Hi @jywarren, thanks for getting in touch! The object store requirement was talked about to simplify things, and it seemed to be a most likely starting candidate for a storage mechanism in OIN. I don't think it was the plan to keep that as a hard requirement though. What storage mechanism would you like to be supported?

Currently, the OIN register holds metadata for the contribution, which in the sample cases are S3 buckets. A reader for OIN would then need to know how to list the contents of a bucket. Each image in the bucket would have metadata associated with it; the spec for that metadata is being worked out here: https://github.com/openimagerynetwork/oin-metadata-spec. Currently we don't have a spec for spec for the JSON that is contributed to the OIN register, and that should certainly be created. I've created an issue for just that: https://github.com/openimagerynetwork/oin-metadata-spec/issues/9

jywarren commented 9 years ago

Hi, Rob - thank you! We just store things as static files on a web host -- HTTP. S3 would be the same in its public HTTP interface, so it seems like the difference is mainly that s3 would provide a more standard directory listing.

Nginx v1.7.9 and higher offers a standard autoindex_format json - http://nginx.org/en/docs/http/ngx_http_autoindex_module.html#autoindex_format

So if it were a manually written index, or the output of a web app, perhaps it should match that format. S3 only provides XML index listing (http://stackoverflow.com/questions/9153552/amazon-s3-response-in-json), which is too bad, but i guess not a problem for a server-side script to parse.

For our part, we'd like to provide a REST (json is fine) index of the assets we're sharing -- this index would just list all the individual metadata files. This could be written statically but we'd probably generate it in MapKnitter, and possibly in the PublicLab.org archive as well.

How does that sound? It'd be really really simple, something like:

["http://url/to/meta.json",
 "http://url/to/meta.json",
 "http://url/to/meta.json"]

We could also adopt an even simpler format -- YML, say, or just a list of newline-delimited URLs, somewhat like an HTML5 manifest file (example: http://spectralworkbench.org/index.manifest):

http://url/to/meta.json
http://url/to/meta.json
http://url/to/meta.json

A hybrid approach where the index.json and meta.json files are combined might get weird because plenty of people may want to simply serve static files, so compiling them into a master json file seems like work they'd have to do each time they added a file. So the nesting doesn't sound too inefficient.

lossyrob commented 9 years ago

Hey @jywarren,

It's not currently a manually listed index that the register entry points to, although that was proposed. I think there's room for both. If we had the type as "s3", we could rely on the client to use s3's API to list the entries. Perhaps for a "uri" type, the uri would point to a JSON file that lists the individual contributed imagery. This way the "uri" type could be quite generic, and would not require the client to have to know how to list that contribution's imagery set, whether it be through an s3 listing or nginx directory listing.

@smit1678 what do you think? It would be great if we could fit a Public Labs contribution into OIN with this contribution type, and have it indexed by OAM.

To be specific, the entry int master.json would look something like:

        {
            "name": "Public Labs",
            "contact": "info@publiclabs.org",
            "locations": [
                {
                    "type": "uri",
                    "format": "application/json",
                    "listing": "http://url/to/oin-listing.json"
                }                
            ]
        }

and then the http://url/to/oin-listing.json would look something like:

{
  "images" : [
    "http://url/to/meta.json",
    "http://url/to/meta.json",
    "http://url/to/meta.json"
  ]
}

or perhaps, this combination:

        {
            "name": "Public Labs",
            "contact": "info@publiclabs.org",
            "locations": [
                {
                    "type": "uri",
                    "format": "application/text",
                    "listing": "http://url/to/oin-listing.txt"
                }                
            ]
        }

oin-listing.txt:

http://url/to/meta.json
http://url/to/meta.json
http://url/to/meta.json

Thoughts?

cholmes commented 9 years ago

Thanks for sounding in @jywarren - would be awesome to make public labs data part of OIN.

So for me there were two major reasons to require an object store:

  1. Support for GET Range queries: https://greenbytes.de/tech/webdav/draft-ietf-httpbis-p5-range-latest.html#range.requests You can use GDAL's VSiCurl to access just the portions of the file you want, instead of having to download huge files. This seems like it should be required to me, to make it a simple but powerful infrastructure. But if your http server can support it then that could work, but I'd want to leave that in as a requirement.
  2. General reliability. To me part of the problem of the OGC web infrastructure vision is that it depends on a number of fairly complex servers staying up with full reliability. Making it just http simplifies it perhaps, but I still worry about people and groups who don't have tons of experience keeping servers running with large reliability being nodes, and then having OIN as a whole feel unreliable. Object stores tend to have really high reliability, backed by teams with lots of experience. And if like a university wanted to donate space to OIN then they could likely run the OpenStack object store.

I could possibly back down on the second point, but I also want the initial spec to be as simple as possible, and opening up s3 vs uri listing to start seems like additional complexity before we've even started.

So I think in my vision of how public labs is part of OIN is that it would be like a 'contributing node', updating an S3 bucket (eventually one that someone else pays for). I do like the notion of more people using the OIN metadata format, perhaps to signal that new data should be ingested. But it feels like we'd have a more reliable overall infrastructure if the base was object stores, instead of a panoply of servers with different SLA's attached to them.

I don't feel super strongly about this though, am still working it out in my mind, so am open to another vision, but to me relying on object stores was a key part of making OIN powerful but simple, while being very reliable.

jywarren commented 9 years ago

Hi Chris! That makes sense; I figured there must be a reason not to stick to vanilla HTTP. Luckily, nginx v1.2.1 (rackspace/debian default atm, I believe) supports range queries:

$ curl  --header "Range: bytes=0-800000" http://archive.publiclab.org/2010/2010-05-07-us-louisiana-portfourchon/geotiff/2010-05-07-louisiana-portfourchon.tif -o partial.tif
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   192  100   192    0     0   1599      0 --:--:-- --:--:-- --:--:--  2526

Rob - that format looks great, and easy, but I'm wondering; what are you reading from s3? Are you using a library? I'm just curious -- if it makes things easier for you all, we could simply adopt the index response of Amazon s3 as our format, although as I mentioned, they don't do JSON AFAIK.

We've backed up our archive to s3, but it takes some finagling to get it to act enough like an HTTP service that Leaflet will load from it, and it's just much more straightforward to get tiles serving and so forth from a standard platform like nginx or apache. And there's something very attractive about not having to go through s3 config dashboards to simply serve files on the web.

Thanks a bunch for your quick responses! This is exciting.

jflasher commented 9 years ago

@jywarren We're using https://github.com/andrewrk/node-s3-client and chiefly the listObjects method to get the contents of the bucket. Part of the function of the library is to convert the response into JSON with an output like below

{ IsTruncated: true,
  Marker: '',
  Contents: 
   [ { Key: '2015-04-13_borahatward_merged_transparent_mosaic_group1.tif',
       LastModified: Wed Jun 03 2015 14:46:52 GMT-0400 (EDT),
       ETag: '"1f42789c7497319b8a63bc2353df5db2"',
       Size: 3671946241,
       StorageClass: 'STANDARD' },
     { Key: '2015-04-13_borahatward_merged_transparent_mosaic_group1.tif_meta.json',
       LastModified: Thu Jun 04 2015 12:10:15 GMT-0400 (EDT),
       ETag: '"236dff37be970baf5bdb9b48e797d384"',
       Size: 1545,
       StorageClass: 'STANDARD' },
     { Key: '2015-04-13_borahatward_merged_transparent_mosaic_group1_thumb.jpg',
       LastModified: Wed Jun 03 2015 15:55:40 GMT-0400 (EDT),
       ETag: '"f59e4909eb18842c6fc201bc9de6c56f"',
       Size: 93826,
       StorageClass: 'STANDARD' },
     { Key: '2015-04-14_hospital_merged_transparent_mosaic_group1.tif',
       LastModified: Wed Jun 03 2015 14:12:38 GMT-0400 (EDT),
       ETag: '"99bb6bcee2b63fd2bd9c05a9e487f542"',
       Size: 2648323243,
       StorageClass: 'STANDARD' },
...
lossyrob commented 9 years ago

@jywarren with the S3 type, clients could use the AWS SDKs that are open sourced and available in most languages...the API calls that list a directory are simply returned in that language's native types.

Just to make sure it's clear (based on the Leaflet comment)...OIN is not going to be hosting tiled imagery, which is what you would need to have Leaflet display it (i.e. make it available through a tiling service, commonly referred to as TMS). It will be hosting RGB GeoTiffs in any projection or any size. We are creating a tiling service for OAM that will take sets of imagery off of OIN and create tile services from it; but that is OpenAerialMap specific and not a requirement to participate in OIN. So that's an important thing to note.

If you're backing up your archive to S3, that might be the type of imagery to share...what format is the archive stored in?

RE @cholmes points:

Support for GET Range queries

We should decide if we're going to prevent storage types that do not support GET range queries in OIN. My thought is no, and instead have that as metadata per type or per contribution, e.g. "range_query_support" : "yes". A combination of that information, along with the per-image file size metadata, can allow clients to either read windows if supported, or if not, choose not to download large images. That way, the images are part of the network, but for any of the readers that only want imagery that was exposed through range queries (and not those images which would not be on the network if that were a requirement), then the client can filter out what it doesn't want to read. There is a potential downside to this: by requiring object stores or GET range query supported servers, we would be encouraging contributors to put the effort into putting the imagery in a place where there is range query support; if that incentive is not there, contributors might not put in that extra effort if the easier thing is to share the imagery in the less ideal way (say FTP).

General reliability

I think this is a really good point, about having to ensure the network is reliable in order to create and keep trust in the network. I do think someone could run an Open Stack object store very poorly, and someone can run an nginx server very well. Perhaps the threshold here is not storage mechanism, but some other measure: if we had a service that continually checked in on imagery providers, or did some sort of spot checking, and notified bad nodes to be marked for removal, we could be inclusive while also keeping out bad apples. This would just check that servers were up, and not necessarily check if a server could handle stress. Maybe some sort of onboarding stress test of new OIN nodes of certain types is in order? There are certainly open questions...not least of which is who would host, operate and maintain this node-checker software. But I believe that these are all things that I think could be worked out, so that the network can be inclusive and also reliable.

jywarren commented 9 years ago

Yes, our particular setup is to store a tileset, geotiff, and jpg alongside one another. It's not relevant to this application though, thanks!

@jywarren https://github.com/jywarren with the S3 type, clients could use the AWS SDKs that are open sourced and available in most languages...the API calls that list a directory are simply returned in that language's native types.

Just to make sure it's clear (based on the Leaflet comment)...OIN is not going to be hosting tiled imagery, which is what you would need to have Leaflet display it (i.e. make it available through a tiling service, commonly referred to as TMS). It will be hosting RGB GeoTiffs in any projection or any size. We are creating a tiling service for OAM that will take sets of imagery off of OIN and create tile services from it; but that is OpenAerialMap specific and not a requirement to participate in OIN. So that's an important thing to note.

If you're backing up your archive to S3, that might be the type of imagery to share...what format is the archive stored in?

RE @cholmes https://github.com/cholmes points:

Support for GET Range queries

We should decide if we're going to prevent storage types that do not support GET range queries in OIN. My thought is no, and instead have that as metadata per type or per contribution, e.g. "range_query_support" : "yes". A combination of that information, along with the per-image file size metadata, can allow clients to either read windows if supported, or if not, choose not to download large images. That way, the images are part of the network, but for any of the readers that only want imagery that was exposed through range queries (and not those images which would not be on the network if that were a requirement), then the client can filter out what it doesn't want to read. There is a potential downside to this: by requiring object stores or GET range query supported servers, we would be encouraging contributors to put the effort into putting the imagery in a place where there is range query support; if that incentive is not there, contributors might no t put in that extra effort if the easier thing is to share the imagery in the less ideal way (say FTP).

General reliability

I think this is a really good point, about having to ensure the network is reliable in order to create and keep trust in the network. I do think someone could run an Open Stack object store very poorly, and someone can run an nginx server very well. Perhaps the threshold here is not storage mechanism, but some other measure: if we had a service that continually checked in on imagery providers, or did some sort of spot checking, and notified bad nodes to be marked for removal, we could be inclusive while also keeping out bad apples. This would just check that servers were up, and not necessarily check if a server could handle stress. Maybe some sort of onboarding stress test of new OIN nodes of certain types is in order? There are certainly open questions...not least of which is who would host, operate and maintain this node-checker software. But I believe that these are all things that I think could be worked out, so that the network can be inclusive and also reliable.

— Reply to this email directly or view it on GitHub https://github.com/openimagerynetwork/oin-register/issues/2#issuecomment-111249807 .

mojodna commented 9 years ago

+1 on a prefix property within a locations entry (via https://github.com/openimagerynetwork/oin-register/issues/1#issuecomment-103229376)

s3-type locations should also include an optional region (and endpoint, which would facilitate S3-compatible (Google Cloud Store, OpenStack Swift, etc.) object stores) property that can be used to reduce latency: http://docs.aws.amazon.com/general/latest/gr/rande.html

On the listability front, that requires a specific bucket policy to be applied (we should document this). The assumption would be that the s3 type can/should be listed unless a pointer to a key/URI containing a list of files/URIs available.

(My preference would be for plaintext listings over bespoke JSON, in part because servers like Nginx, etc. can likely be configured to output them.)

mojodna commented 9 years ago

Aesthetically, is it too late to propose that bucket_name be shortened to bucket?