radiantearth / stac-spec

SpatioTemporal Asset Catalog specification - making geospatial assets openly searchable and crawlable
https://stacspec.org
Apache License 2.0
757 stars 180 forks source link

Best Practice: requestor pays #896

Closed cholmes closed 3 years ago

cholmes commented 3 years ago

As suggested on gitter by @matthewhanson - it'd be good to have a best practice on URL's that are 'requestor pays'. We should capture these thoughts and put them in best practices.

'For requester pays URLs I’ve been using the s3 URL, e.g., s3://syncarto-data-rp/stac/naip/catalog.json Since the http URL is useless on it’s own unless you sign it, so just working with the s3 URLs directly (with AWS CLI or boto3) I think is easier. Plus you can use PySTAC to support s3 reads/writes. If public, then I use the actual http URL. This might be a good thing to add to best practice'

'Even better might be to keep the STAC metadata in a different, and completely public, bucket that isn’t requester pays. Normally I like the data alongside the STAC Items, but I think it’s better if it’s public. That way you can use tools like STAC browser, PySTAC, without authentication for just the metadata.'

m-mohr commented 3 years ago

Ist this just for S3? Would a person buying data at Planet also be "requester pays"? Or how exactly is that defined outside of S3?

davidraleigh commented 3 years ago

This is a field I have on the grpc STAC version of assets: https://geo-grpc.github.io/api/#epl.protobuf.v1.Asset

It's also used in google cloud: https://cloud.google.com/storage/docs/requester-pays

And I imagine it also exists in Azure

jflasher commented 3 years ago

I think it'd definitely be good to have requester pays called out in the metadata as it presents a technical and financial difference in how you access the data. I have tried to create the signatures myself for use with straight HTTP requests, but always fall back on the available SDKs. Also, at least for AWS, there are two costs incurred with requester pays, the egress and a per-request fee. The per-request fee is generally very small compared to egress cost, but this is not always the case (specifically in cases of listing the bucket contents) and likely should be mentioned for completeness.

philvarner commented 3 years ago

S3 and Google have RP, Azure apparently does not.

Overall, I think these concepts are cross-provider (e.g., not only S3) and useful enough to warrant an extension.

I like some of the fields in @davidraleigh 's link -- a few comments on them:

matthewhanson commented 3 years ago

A couple years ago we talked about "storage profiles" for STAC to describe some of these things, but nothing ever came of it.

I think a "cloud_storage" extension is warranted (or maybe just "cloud"). It can be set in Item properties, but could also be set per asset using the general Asset specific metadata rule:

Fields:

I'd avoid putting in bucket and object path, converting between s3 and http URLs is easy enough, and would be good to avoid dupliication.

davidraleigh commented 3 years ago

We use STAC a lot internally, so object_path and bucket are useful to those internal users who have access permissions to use them, but for customers there is an href that isn't constructed from bucket + object_path.

matthewhanson commented 3 years ago

@davidraleigh Ah, so this is really a case where you might have multiple URLs to the same assets. We've run into this where we use s3 URLs, but for external users we have cloudfront URLs. We've been handling that just be translating the URLs in a service built on top of the normal STAC API.

I could see an "alternate_hrefs" array in assets for something like this, if we wanted it to be more general. This would also be able to represent actual data mirrors.

davidraleigh commented 3 years ago

I'm stumped for which is the clearest method. I love the object_path and bucket, because I think of everything as having a bucket. But I could see something like alternate_hrefs as not being too attached to the whole bucket cloud storage paradigm.

cholmes commented 3 years ago

Two things here:

We want to provide real recommendations for next release.

cholmes commented 3 years ago

@matthewhanson - I can take on the work of writing this up, but need a clearer idea of what exactly to say. Others please sound in as well - I'm happy to try to write this up, but I don't have deep experience with stac & cloud locations.

I noted a bit from our call. My questions:

jflasher commented 3 years ago

In addition to the fields mentioned above, I think having something like storage_class would also be useful. I think we'll see datasets in the future that have a mix of warm and cold storage. You'd still want the metadata for the data in cold storage but it'd be beneficial to know that the data will not be immediately available.

Talking myself out of the above, data generally gets brought out of cold storage for some period of time and then returned. So its storage_class is not constant. If the STAC entry isn't updated when it's brought out of cold storage, this field likely becomes less useful. The likely pattern without this entry (or if it's not updated) is that you'd 1) request the object, 2) get a message that says it's not available and then 3) follow some other step to bring it out of cold storage. An update field here likely just lets you skip step 1.

Also, think it would definitely be good to include region. I presume we'd want to use the platform-specific region designations? That'll be less meaningful to someone using a different platform but a) it's likely not of interest to them anyways and b) doesn't seem like STAC's role to somehow unify those designations.

cholmes commented 3 years ago

Storage class does seem like a good option to have. Is there a generic / cross cloud way to refer to them? I'm not deep on the options and how they map across clouds. Perhaps we'd have a little table that maps the generic name to the names on each of the major services.

Region I agree we'd need platform specific designations.

If anyone has time to write up a PR, even a draft one, on the extension that'd be much appreciated, as I've got a backlog for 1.0-RC1 stuff. I guess as an extension this doesn't need to be done by RC1, but it'd be nice to have.

jflasher commented 3 years ago

Thinking of this a little more, maybe it's important to get to the point of tracking storage class? So maybe something like immediately_available:T/F or retrieval_needed:T/F. While storage_class seems useful, I feel like it may be putting some effort on the user to figure out what a given storage class means.

davidraleigh commented 3 years ago

@cholmes what's the timeline for writing up a PR? I'm a little bogged down the next week and a half, but I could put more thought into it after that.

I would like a bitmask enum that I can use on the Asset and StacItem level, that has provider storage level information. I could search for all data that's currently on nearline and prepare to move it to coldline (using GCP terms for a minute). We have STAC items in multiple cloud providers, so a bitmask would allow me to look at what's nearline in AWS and coldline in GCP. And then on the asset level itself I could use the enum to define the status of the item.

m-mohr commented 3 years ago

FYI: In STAC Index I've three classes of availability: public (accessible without any authentication), protected (authentication required for data access, but metadata accessible to all) and private (authentication required for all and/or only accessible to some groups, e.g. you must sign a contract first, you must be living in a specific county (geo-fenced), or be working for a federal government).

I presume we'd want to use the platform-specific region designations?

Yes, I think so, too. You usually can't search for that anyway because it's hidden in assets.

cholmes commented 3 years ago

So maybe something like immediately_available:T/F or retrieval_needed

I like the direction to more generic. But once we get to here I start thinking about the general use case of 'ordering' data - providers (like Planet) generate the geotiff's on demand. It'd be great to cover that too, and retrieval_needed vs immediately_available seem like they'd work. It probably wouldn't be a 'cloud storage' extension, but perhaps a pair of extensions - one on 'asset availability' or something like that, and one on cloud storage - with regions and requestor pays. The asset availability would also hopefully cover @m-mohr's public/protected/private as well.

@cholmes what's the timeline for writing up a PR?

@davidraleigh - I doubt I'll get to it in the next two weeks, so if you could do it within that time frame that'd be great.

m-mohr commented 3 years ago

By the way, there are related issues for accessing and ordering data: #836 and #891

davidraleigh commented 3 years ago

@cholmes I can make an attempt at a pull request for this week

cholmes commented 3 years ago

@davidraleigh - awesome! Be warned, we are going to move most of the extensions out of the core repo soon, see #946 But feel free to make a PR here, it'll just probably be applied to another repo.

cholmes commented 3 years ago

Circling back on this - we've got a lot of great energy on the cloud storage extension. But I don't think we need that for 1.0.0, as STAC works fine without it, and it'll be a nice addition to have in an extension (I'm not set on that, but would want to hear a good argument).

But what do we want to actually say in the spec itself? Should we call out the use of s3:// style url's in a best practice? And say that those are recommended to be used when data is requestor pays? And to also recommend that people don't put their STAC metadata in requestor pays buckets?

cholmes commented 3 years ago

For main best practices:

cholmes commented 3 years ago

Closing this, though note if we do get a storage extension soon then we should link to it from this best practice.