Open rabernat opened 3 years ago
@zflamig - since these are official public datasets, is this something you can set up on your end? Or does it need to be run on one of our projects?
@rabernat You should run it on your side. You can configure it in the accounts which contain the data. Feel free to make new buckets for storing the outputs there too.
@rabernat @zflamig I'll aim to configure this on esgf-world this week.. we could then assess and apply the same configurations on cmip6-pds.
I was able to configure S3 inventory for storage management (daily reporting for now), using these steps. It can take up to 48 hours for the first parquet to be generated, per documentation. In my case, it took about a day for the parquet to reflect the state of the bucket. @rabernat similar configurations must be applied to the cmip6-pds. How should we proceed with that?
There was a bit of confusion from my end, since there appears to be two parquet files to reflect the S3 inventory for a timestamp. I referred to pointers from manifest.json (for a given day) from the s3 inventory destination bucket and then looked up pointers to the the parquet files specified as keys. Referring to symlink.txt in hive for pointers to the parquet files also seems to be easier and working. @zflamig please let me know if there is a different recommended approach to look for the most recent parquet files? I also tried Athena, which was nice. But, for my current use-case, I am reading parquet as a pandas dataframe.
Please refer to notebook updates to #16.
Here is an example parquet(encrypted): s3://s3-inventory-dest/esgf-world/cmip6-inventory/data/66ee042f-0b7e-4143-a6ac-8837abe1d421.parquet
aws s3 cp
Great! Can you make that url public so that we can all read it? There is nothing private in there.
Also, since you have managed to figure this out, would you mind also pointing it at that cmip6-pds
bucket? Do you need extra permissions for that?
I believe I need access to the AWS account corresponding to the bucket..which I don't have for cmip6-pds.
Gotcha. @naomi-henderson, do you have those credentials? Can you share them with @aradhakrishnanGFDL?
Great! Can you make that url public so that we can all read it? There is nothing private in there.
Agreed. Please let me know if you're unable to access the following.
Yes, it works!
This works
import pandas as pd
url = 'https://s3-inventory-dest.s3.us-east-2.amazonaws.com/esgf-world/cmip6-inventory/data/66ee042f-0b7e-4143-a6ac-8837abe1d421.parquet'
df = pd.read_parquet(url)
Also this work:
url = 's3://s3-inventory-dest/esgf-world/cmip6-inventory/data/66ee042f-0b7e-4143-a6ac-8837abe1d421.parquet'
df = pd.read_parquet(url, storage_options=dict(anon=True))
Yes, I have the credentials and I will send them to you @aradhakrishnanGFDL
@rabernat @naomi-henderson Configured S3 Inventory on cmip6-pds. I will log back in tomorrow to see if it works!
@aradhakrishnanGFDL Fantastic!
@zflamig suggested we use S3 storage inventory to index our buckets: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html
I recommend we