pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.
https://pangeo-forge.readthedocs.io/
Apache License 2.0
125 stars 54 forks source link

Estimate recipe size #136

Open rabernat opened 3 years ago

rabernat commented 3 years ago

It would be very useful to get an estimate of the total size of the target dataset produced by a recipe in GB / TB. For example, this information could be used by bakery managers to decide whether to accept a dataset into their storage.

Here are some different ways we could do this without actually running the whole recipe.

  1. Create a test version of the recipe (see #97) and examine the total size of the test target. Scale up based on the "pruning factor" (what fraction of the full data did the test dataset pull).
  2. Go through each file in the recipe's FilePattern and inspect its size. Sum to get an estimated size. Only works for static file inputs (not APIs like OPeNDAP). May not accurately reflect target size if there is lots of processing involved.
  3. Randomly sample files from the FilePattern and scale up.
cisaacstern commented 3 years ago
  1. Create a test version of the recipe (see #97) and examine the total size of the test target. Scale up based on the "pruning factor" (what fraction of the full data did the test dataset pull).

Are there known reasons why this is not the obvious best direction to pursue? It seems to dovetail nicely with other objectives, and should be relatively accurate, assuming the as-yet-unimplemented prune method referenced in https://github.com/pangeo-forge/staged-recipes/pull/28#issuecomment-829482555 is "prune factor"-aware.