Best Practice: requestor pays

cholmes commented 3 years ago

As suggested on gitter by @matthewhanson - it'd be good to have a best practice on URL's that are 'requestor pays'. We should capture these thoughts and put them in best practices.

'For requester pays URLs I’ve been using the s3 URL, e.g., s3://syncarto-data-rp/stac/naip/catalog.json Since the http URL is useless on it’s own unless you sign it, so just working with the s3 URLs directly (with AWS CLI or boto3) I think is easier. Plus you can use PySTAC to support s3 reads/writes. If public, then I use the actual http URL. This might be a good thing to add to best practice'

'Even better might be to keep the STAC metadata in a different, and completely public, bucket that isn’t requester pays. Normally I like the data alongside the STAC Items, but I think it’s better if it’s public. That way you can use tools like STAC browser, PySTAC, without authentication for just the metadata.'

m-mohr commented 3 years ago

Ist this just for S3? Would a person buying data at Planet also be "requester pays"? Or how exactly is that defined outside of S3?

davidraleigh commented 3 years ago

This is a field I have on the grpc STAC version of assets: https://geo-grpc.github.io/api/#epl.protobuf.v1.Asset

It's also used in google cloud: https://cloud.google.com/storage/docs/requester-pays

And I imagine it also exists in Azure

jflasher commented 3 years ago

I think it'd definitely be good to have requester pays called out in the metadata as it presents a technical and financial difference in how you access the data. I have tried to create the signatures myself for use with straight HTTP requests, but always fall back on the available SDKs. Also, at least for AWS, there are two costs incurred with requester pays, the egress and a per-request fee. The per-request fee is generally very small compared to egress cost, but this is not always the case (specifically in cases of listing the bucket contents) and likely should be mentioned for completeness.

philvarner commented 3 years ago

S3 and Google have RP, Azure apparently does not.

Overall, I think these concepts are cross-provider (e.g., not only S3) and useful enough to warrant an extension.

I like some of the fields in @davidraleigh 's link -- a few comments on them:

cloud_platform (Google cloud, Azure, etc) -- I think this is good to have. I use the "s3://" scheme in most of my asset hrefs, but I could just as well have used the virtual-host-style https uris. I assume this can be done with Google storage instead of "gs://", and Azure blob storage only uses https uris (e.g., https://myaccount.blob.core.windows.net/mycontainer/myblob)
bucket_region - I only know about S3, and there's no way to find out what region an object is in with just the "s3://" uri, without making a request to the us-east-1 and seeing where you get redirected.
bucket and object_path - duplicate what's in the href, but may be useful so you don't have to parse it?
requester_pays - would definitely be useful to have this. In most of our code, we just set it on by default, as it has no effect if the bucket isn't RP, but if you make a request to an RP bucket and don't set the RP flag in the request, you get a generic AccessDenied message, and it's not apparent that you needed to set the RP flag. I could imagine an end-user client like QGIS wanting to pop up a warning stating "You're going to incur cost to download this, continue?" rather than just pulling a lot of data without the user really knowing that's happening (e.g., not as apparent as when using the awscli)

matthewhanson commented 3 years ago

A couple years ago we talked about "storage profiles" for STAC to describe some of these things, but nothing ever came of it.

I think a "cloud_storage" extension is warranted (or maybe just "cloud"). It can be set in Item properties, but could also be set per asset using the general Asset specific metadata rule:

Fields:

platform: aws, azure, google, etc
region: It would be useful to be able to have region, since as @philvarner points out you have to make a request to find out region (there's a get_region function in boto3).
requester_pays: true/false?

I'd avoid putting in bucket and object path, converting between s3 and http URLs is easy enough, and would be good to avoid dupliication.

davidraleigh commented 3 years ago

We use STAC a lot internally, so object_path and bucket are useful to those internal users who have access permissions to use them, but for customers there is an href that isn't constructed from bucket + object_path.

matthewhanson commented 3 years ago

@davidraleigh Ah, so this is really a case where you might have multiple URLs to the same assets. We've run into this where we use s3 URLs, but for external users we have cloudfront URLs. We've been handling that just be translating the URLs in a service built on top of the normal STAC API.

I could see an "alternate_hrefs" array in assets for something like this, if we wanted it to be more general. This would also be able to represent actual data mirrors.

davidraleigh commented 3 years ago

I'm stumped for which is the clearest method. I love the object_path and bucket, because I think of everything as having a bucket. But I could see something like alternate_hrefs as not being too attached to the whole bucket cloud storage paradigm.

cholmes commented 3 years ago

Two things here:

Use non-http url's
Alternate URL's

We want to provide real recommendations for next release.

cholmes commented 3 years ago

@matthewhanson - I can take on the work of writing this up, but need a clearer idea of what exactly to say. Others please sound in as well - I'm happy to try to write this up, but I don't have deep experience with stac & cloud locations.

I noted a bit from our call. My questions:

For use of non-http urls - do we want to call this out explicitly in the main spec? That you are allowed to use them? And then do we want to make it a recommendation that if you are using requestor pays you should use those? And not use http?
Alternate urls - do we want to add a specific field for this? An extension?
Do we still want a requestor pays best practice? That says if you are doing requestor pays then a s3:// style url - http should be used.
Do we want a cloud_storage extension? With platform / region and requestor pays?

jflasher commented 3 years ago

In addition to the fields mentioned above, I think having something like storage_class would also be useful. I think we'll see datasets in the future that have a mix of warm and cold storage. You'd still want the metadata for the data in cold storage but it'd be beneficial to know that the data will not be immediately available.

Talking myself out of the above, data generally gets brought out of cold storage for some period of time and then returned. So its storage_class is not constant. If the STAC entry isn't updated when it's brought out of cold storage, this field likely becomes less useful. The likely pattern without this entry (or if it's not updated) is that you'd 1) request the object, 2) get a message that says it's not available and then 3) follow some other step to bring it out of cold storage. An update field here likely just lets you skip step 1.

Also, think it would definitely be good to include region. I presume we'd want to use the platform-specific region designations? That'll be less meaningful to someone using a different platform but a) it's likely not of interest to them anyways and b) doesn't seem like STAC's role to somehow unify those designations.

cholmes commented 3 years ago

Storage class does seem like a good option to have. Is there a generic / cross cloud way to refer to them? I'm not deep on the options and how they map across clouds. Perhaps we'd have a little table that maps the generic name to the names on each of the major services.

Region I agree we'd need platform specific designations.

If anyone has time to write up a PR, even a draft one, on the extension that'd be much appreciated, as I've got a backlog for 1.0-RC1 stuff. I guess as an extension this doesn't need to be done by RC1, but it'd be nice to have.

jflasher commented 3 years ago

Thinking of this a little more, maybe it's important to get to the point of tracking storage class? So maybe something like immediately_available:T/F or retrieval_needed:T/F. While storage_class seems useful, I feel like it may be putting some effort on the user to figure out what a given storage class means.

davidraleigh commented 3 years ago

@cholmes what's the timeline for writing up a PR? I'm a little bogged down the next week and a half, but I could put more thought into it after that.

I would like a bitmask enum that I can use on the Asset and StacItem level, that has provider storage level information. I could search for all data that's currently on nearline and prepare to move it to coldline (using GCP terms for a minute). We have STAC items in multiple cloud providers, so a bitmask would allow me to look at what's nearline in AWS and coldline in GCP. And then on the asset level itself I could use the enum to define the status of the item.

m-mohr commented 3 years ago

FYI: In STAC Index I've three classes of availability: public (accessible without any authentication), protected (authentication required for data access, but metadata accessible to all) and private (authentication required for all and/or only accessible to some groups, e.g. you must sign a contract first, you must be living in a specific county (geo-fenced), or be working for a federal government).

I presume we'd want to use the platform-specific region designations?

Yes, I think so, too. You usually can't search for that anyway because it's hidden in assets.

cholmes commented 3 years ago

So maybe something like immediately_available:T/F or retrieval_needed

I like the direction to more generic. But once we get to here I start thinking about the general use case of 'ordering' data - providers (like Planet) generate the geotiff's on demand. It'd be great to cover that too, and retrieval_needed vs immediately_available seem like they'd work. It probably wouldn't be a 'cloud storage' extension, but perhaps a pair of extensions - one on 'asset availability' or something like that, and one on cloud storage - with regions and requestor pays. The asset availability would also hopefully cover @m-mohr's public/protected/private as well.

@cholmes what's the timeline for writing up a PR?

@davidraleigh - I doubt I'll get to it in the next two weeks, so if you could do it within that time frame that'd be great.

m-mohr commented 3 years ago

By the way, there are related issues for accessing and ordering data: #836 and #891

davidraleigh commented 3 years ago

@cholmes I can make an attempt at a pull request for this week

cholmes commented 3 years ago

@davidraleigh - awesome! Be warned, we are going to move most of the extensions out of the core repo soon, see #946 But feel free to make a PR here, it'll just probably be applied to another repo.

cholmes commented 3 years ago

Circling back on this - we've got a lot of great energy on the cloud storage extension. But I don't think we need that for 1.0.0, as STAC works fine without it, and it'll be a nice addition to have in an extension (I'm not set on that, but would want to hear a good argument).

But what do we want to actually say in the spec itself? Should we call out the use of s3:// style url's in a best practice? And say that those are recommended to be used when data is requestor pays? And to also recommend that people don't put their STAC metadata in requestor pays buckets?

cholmes commented 3 years ago

For main best practices:

put your metadata in non-requestor pays
use storage extension
If there is not publicly available http url then use a url that is the right protocol for the file (s3:// gs://)

cholmes commented 3 years ago

Closing this, though note if we do get a storage extension soon then we should link to it from this best practice.

radiantearth / stac-spec

Best Practice: requestor pays #896