os-climate / physrisk

Physical climate risk calculation engine
Apache License 2.0
27 stars 37 forks source link

Enhance hazard data set security and facilitate download from S3 #133

Open joemoorhouse opened 1 year ago

joemoorhouse commented 1 year ago

Is your feature request related to a problem? Please describe. Currently there is no dedicated S3 for hazard data. Rather we use use redhat-osc-physical-landing-647521352890 We also recently introduced another bucket for hazard model development: https://github.com/os-climate/os_c_data_commons/issues/273 physrisk-hazard-indicators But I think we need separate dedicated buckets for test and 'prod'. 'Prod' means here the store used by the sandbox, but we need a low risk of accidental overwriting even so I think.

Describe the solution you'd like Do we have: physrisk-hazard-indicators physrisk-hazard-indicators-test for example and use physrisk-hazard-indicators for 'prod' or physrisk-hazard-indicators-prod? Not sure if there is a convention used across OS-C. Secrets are maintained here by the way: https://console-openshift-console.apps.odh-cl1.apps.os-climate.org/k8s/ns/sandbox/secrets/physrisk-s3-keys

Separately, there is a need for non-members to be able to download bucket contents. Ideally we don't want to make buckets public as model is to federate or for members to host their own data (i.e. OS-C not to be data provider). SFTP could facilitate the latter.

joemoorhouse commented 1 year ago

Hi @redmikhail, Sorry, I really dragged my heels over this issue! - this relates to what we discussed a couple of weeks ago or so. We need to control better the OS-C hazard indicator data sets. As discussed we probably need separate and dedicated 'test' and 'prod' buckets? Not sure about naming conventions. @MichaelTiemannOSC, @MightyNerdEric and @HeatherAck also FYI

redmikhail commented 1 year ago

Hi @joemoorhouse , from the naming convention perspective generally to separate infrastructure buckets we following osc---bucket(example: osc-nlp-data-bucket01-s3) . At the same time since you already have bucket created there is no point changing the name. I would suggest to keep physrisk-hazard-indicators for "prod" and have second bucket named physrisk-hazard-indicators-dev01 (considering that it potentially will be used for both purposes) . Regarding public access - I am starting to wonder if we should have separate bucket that will be publicly available for OS-C general use including for physical risk . It seems that we will have some raw data that we potentially need to share with community outside of OS-Climate organization. Please let me know what do you think.

joemoorhouse commented 1 year ago

Hi @redmikhail, We discussed together on Friday an option:

The idea is that even non-members can easily get set up and take a copy of data using the public read-only bucket. Some care needed to ensure that users are taking their own data.

I think the option seems good, but am happy to be guided by you and @MichaelTiemannOSC in this. As long as we have a separate dev bucket to avoid accidents and users have some way to download data then I'm happy!

If you are both happy with the public readonly bucket option then please go ahead and create and then I'll transfer the data across.

Thanks, Joe

keshavnath1 commented 1 year ago

Hi @joemoorhouse , I have created public s3 bucket "arn:aws:s3:::redhat-osc-physical-landing-64759867891", let me know if you can move data to this

Thanks Keshav

joemoorhouse commented 1 year ago

Hi @keshavnath1; hi @redmikhail, Thanks for this. How do I get the credentials for writing? I tried and these seem to be different credentials from the bucket?: redhat-osc-physical-landing-647521352890

Also, I guess we would want credentials to allow the user of redhat-osc-physical-landing-647521352890 to be able to use PutObject and CopyObject on redhat-osc-physical-landing-64759867891 - for efficient transfer between the two? Or is there another/better way to do this? I saw this for example on the subject: https://stackoverflow.com/questions/65577223/aws-s3-copy-object-from-one-bucket-to-another-with-different-credentials

Also, I thought from the above we were going with naming convention 'physrisk-hazard-indicator-...'? Thanks, Joe

samanth91 commented 1 year ago

Hi @joemoorhouse @keshavnath1 is my team mate told me to look into this s3 write. Is it ok if i create a aws iam user send you the credentials and you can write to that bucket . Once you are done writing to that bucket will suspend the user. let me know if this is ok? If so can you give a email id so that I can send aws credentials to that.

ryanaslett commented 1 year ago

Hi @joemoorhouse Who are @samanth91 and @keshavnath1 and what is their role? 64759867891 is not an aws account that is managed by OS Climate, so I have no idea where that bucket is.

HeatherAck commented 1 year ago

Hi @ryanaslett - both are from freddiemac.com and are new users who want to test the Physical Risk & Resilience tool - I believe they plan to test it locally in their own environment.

joemoorhouse commented 1 year ago

Hi @ryanaslett, Ah, and I assumed they were colleagues of @redmikhail! Thanks for clearing that up @HeatherAck!

So @samanth91 and @keshavnath1, the idea is that OS Climate is creating a public bucket, then you can transfer from there however you like. I believe @redmikhail is working on that. I'll give you the details once that is complete.

joemoorhouse commented 1 year ago

Hi @redmikhail, Did you have a chance to create the public bucket? I believe@samanth91 and @keshavnath1 need this to continue their work. I think you were considering creating the buckets: os-climate-public-data (readonly; public) physrisk-hazard-indicators-dev01 (private; dev) physrisk-hazard-indicators (private; prod)

Although I think os-climate-public-data is the most urgent. Thanks, Joe

HeatherAck commented 1 year ago

@ryanaslett and @MightyNerdEric - could you please create the public bucket. I think @redmikhail may be on vacation this week.

redmikhail commented 1 year ago

@HeatherAck - my apologies ! I was away for July 3rd and 4th for holidays

redmikhail commented 1 year ago

Hi @joemoorhouse , all tasks should be completed now. Here are details for the configuration:

All credentials are added to secrets physrisk-s3-keys(physrisk-hazard-indicator), physrisk-dev-s3-keys(physrisk-hazard-indicators-dev01),physrisk-public-s3-keys (os-climate-public-data for rw access)

To copy data to public bucket you can use aws s3 commands connecting with appropriate AWS keys (from physrisk-public-s3-keys secret) , for an example: aws s3 cp s3://physrisk-hazard-indicators/bucket_test2.txt s3://os-climate-public-data/physrisk/bucket_test2.txt Data can be rerieved by any user using web browser specifying direct url, example https://os-climate-public-data.s3.amazonaws.com/physrisk/bucket_test2.txt, curl commands - curl -L https://os-climate-public-data.s3.amazonaws.com/physrisk/bucket_test2.txt -o ./physrisk/bucket_test2.txt or using aws cli - aws s3 cp s3://os-climate-public-data/physrisk/bucket_test2.txt --region us-east-1 --no-sign-request

joemoorhouse commented 1 year ago

Thanks @redmikhail, that's great... I'm on vacation also hence delay in reply! I'll give the copying a go.

joemoorhouse commented 1 year ago

Hi @redmikhail, @keshavnath1, @samanth91,

Sorry for the delay - vacations got in the way. I have now copied the hazard data from the old bucket to 'physrisk-hazard-indicators'. Still to do to migrate sandbox over to point to new bucket.

I also copied hazard data into 'os-climate-public-data' which is therefore now publically accessible.

List operations on os-climate-public-data are not permitted as @redmikhail mentioned above, which will of course make taking a copy problematic! To get around this, I've added a file hazard/keys.txt with the list of the keys comprising the hazard files. There are about 78,000 / 45 GB there currently. The large number is from the chunking of the (zarr) data.

https://os-climate-public-data.s3.amazonaws.com/hazard/keys.txt

The idea is then to copy the keys in this list (e.g. using boto3 copy_object or similar) or subset for demonstration purposes.

samanth91 commented 1 year ago

@joemoorhouse Thanks for the info will be using those keys.