indexing with s3 storage inventory

rabernat commented 3 years ago

@zflamig suggested we use S3 storage inventory to index our buckets: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html

I recommend we

Run this every week on both NetCDF and Zarr buckets
Store the data in Parquet format

rabernat commented 3 years ago

@zflamig - since these are official public datasets, is this something you can set up on your end? Or does it need to be run on one of our projects?

zflamig commented 3 years ago

@rabernat You should run it on your side. You can configure it in the accounts which contain the data. Feel free to make new buckets for storing the outputs there too.

aradhakrishnanGFDL commented 3 years ago

@rabernat @zflamig I'll aim to configure this on esgf-world this week.. we could then assess and apply the same configurations on cmip6-pds.

aradhakrishnanGFDL commented 3 years ago

I was able to configure S3 inventory for storage management (daily reporting for now), using these steps. It can take up to 48 hours for the first parquet to be generated, per documentation. In my case, it took about a day for the parquet to reflect the state of the bucket. @rabernat similar configurations must be applied to the cmip6-pds. How should we proceed with that?

There was a bit of confusion from my end, since there appears to be two parquet files to reflect the S3 inventory for a timestamp. I referred to pointers from manifest.json (for a given day) from the s3 inventory destination bucket and then looked up pointers to the the parquet files specified as keys. Referring to symlink.txt in hive for pointers to the parquet files also seems to be easier and working. @zflamig please let me know if there is a different recommended approach to look for the most recent parquet files? I also tried Athena, which was nice. But, for my current use-case, I am reading parquet as a pandas dataframe.

Please refer to notebook updates to #16.

Here is an example parquet(encrypted): s3://s3-inventory-dest/esgf-world/cmip6-inventory/data/66ee042f-0b7e-4143-a6ac-8837abe1d421.parquet

aws s3 cp .

rabernat commented 3 years ago

Great! Can you make that url public so that we can all read it? There is nothing private in there.

rabernat commented 3 years ago

Also, since you have managed to figure this out, would you mind also pointing it at that cmip6-pds bucket? Do you need extra permissions for that?

aradhakrishnanGFDL commented 3 years ago

I believe I need access to the AWS account corresponding to the bucket..which I don't have for cmip6-pds.

rabernat commented 3 years ago

Gotcha. @naomi-henderson, do you have those credentials? Can you share them with @aradhakrishnanGFDL?

aradhakrishnanGFDL commented 3 years ago

Great! Can you make that url public so that we can all read it? There is nothing private in there.

Agreed. Please let me know if you're unable to access the following.

https://s3-inventory-dest.s3.us-east-2.amazonaws.com/esgf-world/cmip6-inventory/hive/dt%3D2021-03-30-00-00/symlink.txt

https://s3-inventory-dest.s3.us-east-2.amazonaws.com/esgf-world/cmip6-inventory/data/66ee042f-0b7e-4143-a6ac-8837abe1d421.parquet

https://s3-inventory-dest.s3.us-east-2.amazonaws.com/esgf-world/cmip6-inventory/data/8915d082-3c0f-48ad-8300-9c30335f09a3.parquet

rabernat commented 3 years ago

Yes, it works!

This works

import pandas as pd
url = 'https://s3-inventory-dest.s3.us-east-2.amazonaws.com/esgf-world/cmip6-inventory/data/66ee042f-0b7e-4143-a6ac-8837abe1d421.parquet'
df = pd.read_parquet(url)

Also this work:

url = 's3://s3-inventory-dest/esgf-world/cmip6-inventory/data/66ee042f-0b7e-4143-a6ac-8837abe1d421.parquet'
df = pd.read_parquet(url, storage_options=dict(anon=True))

naomi-henderson commented 3 years ago

Yes, I have the credentials and I will send them to you @aradhakrishnanGFDL

aradhakrishnanGFDL commented 3 years ago

@rabernat @naomi-henderson Configured S3 Inventory on cmip6-pds. I will log back in tomorrow to see if it works!

naomi-henderson commented 3 years ago

@aradhakrishnanGFDL Fantastic!

pangeo-forge / cmip6-pipeline

indexing with s3 storage inventory #15