pangeo-data / swot_adac_ogcms

Documentation and notebooks for the SWOT Adopt-a-Crossover Model Intercomparison
Apache License 2.0
9 stars 3 forks source link

Description of the Google Cloud scratch bucket #13

Closed roxyboy closed 2 years ago

roxyboy commented 2 years ago

A PR to add more details in README regarding the intermediate results stored on GC bucket and that they are not publicly available.

review-notebook-app[bot] commented 2 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

roxyboy commented 2 years ago

@yangleir Would the edits to README in this PR be helpful for you?

yangleir commented 2 years ago

Thanks very much! I have some questions about the scratch storage:

roxyboy commented 2 years ago
  • To fully use these notebooks, other users must run some codes to produce the files that you saved at scratch storage on GCP. How to do that? Are these codes available?

Yes, all of the code used to produce the intermediate outputs stored on the scratch storage are (should be) available in the Jupyter notebooks stored on this Github repository.

  • Why not use the cat in all examples ? And just produce diagnose files by these notebooks? Is it because the diagnose computation heavy and diagnose files too huge?

Yes to your last question. Sometimes intermediate steps are computationally heavy and I wanted to avoid re-running the same code multiple times.

yangleir commented 2 years ago

I read your nice paper more careful to understand your idea more clear. You said:

the scratch storage on GCP where we have saved our diagnostic outputs entailed storage and egress fees. The cost of GCP resources for JupyterHub with parallelized computation added up to roughly EUR 1000 per month for this study with the maximum computational resources of 64 cores and 256 Gb of memory per user; the resources scale on demand. As of writing, we have consumed 3.5 tera-hours of CPU and 92.1 Tb of RAM monthly on average

It is really heavy computation. And it is too hard for the Pangeo cloud platform itself (right?). So, dose that mean if I want to reproduce all your figures, I need to pay GCP fees about EUR 1000 per month? It is bit of expensive for personal pay :)

roxyboy commented 2 years ago

It is really heavy computation. And it is too hard for the Pangeo cloud platform itself (right?). So, dose that mean if I want to reproduce all your figures, I need to pay GCP fees about EUR 1000 per month? It is bit of expensive for personal pay :)

You can access the data on OSN from your local laptop or cluster so if you really want to reproduce every single result in the paper, you could just work in this way. The GCP fees were when I executed the analyses on our cloud-based JupyterHub. There is nothing stopping you from setting up a JupyterHub on your local environment and run the notebooks yourself there.

yangleir commented 2 years ago
  • To fully use these notebooks, other users must run some codes to produce the files that you saved at scratch storage on GCP. How to do that? Are these codes available?

Yes, all of the code used to produce the intermediate outputs stored on the scratch storage are (should be) available in the Jupyter notebooks stored on this Github repository.

  • Why not use the cat in all examples ? And just produce diagnose files by these notebooks? Is it because the diagnose computation heavy and diagnose files too huge?

Yes to your last question. Sometimes intermediate steps are computationally heavy and I wanted to avoid re-running the same code multiple times.

Many Thanks to you! I have no more question at present. Your edit of the readme is great.

roxyboy commented 2 years ago

You can access the data on OSN from your local laptop or cluster so if you really want to reproduce every single result in the paper, you could just work in this way. The GCP fees were when I executed the analyses on our cloud-based JupyterHub. There is nothing stopping you from setting up a JupyterHub on your local environment and run the notebooks yourself there.

The accessing of the data from OSN to your local environment is free of charge as of now.

yangleir commented 2 years ago

You can access the data on OSN from your local laptop or cluster so if you really want to reproduce every single result in the paper, you could just work in this way. The GCP fees were when I executed the analyses on our cloud-based JupyterHub. There is nothing stopping you from setting up a JupyterHub on your local environment and run the notebooks yourself there.

The accessing the data from OSN to your local environment is free of charge as of now.

Great! I will keep to learn. Thank you.