pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.
https://pangeo-forge.readthedocs.io/
Apache License 2.0
126 stars 54 forks source link

feature request: missing files #374

Open ghislainp opened 2 years ago

ghislainp commented 2 years ago

I have a dataset with one file per day but some files are missing. Is there a way to deal with this case programmatically ? For instance a function like process_input that would be called when a file is missing. process_missing ?

cisaacstern commented 2 years ago

👋 @ghislainp, thanks for this question. Are you able to determine which specific dates are missing prior to writing the recipe?

If so, you could employ a pattern like this:

https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/blob/32ba8c8f6a639975a1061ece699ac2f053cb8d02/feedstock/recipe.py#L7-L18

to drop them from the file list before the recipe is executed.

This is probably the easiest way to handle this case at the moment. Automatically skipping over missing dates during recipe execution is not currently supported, though that would certainly be worth aiming for eventually.

ghislainp commented 2 years ago

I could but the resulting structure of the output data is not regular in time if some dates are skipped.

Is it possible to re-align/ the dataset after the concatenation, before writting the zarr ? I assume by using the process_chunk function, but the output of the process_chunk would be larger than the input and what would happen if the missing date is between two chunks...

cisaacstern commented 2 years ago

I see. If I understand correctly, you would ideally like arrays of NaNs (or some other filler value) in place of the empty dates, so that the dataset chunking remains correctly aligned, despite the missing dates?

To the best of my knowledge, this is not currently possible (at least, without some seriously hacky maneuvers), but the ongoing work to resolve https://github.com/pangeo-forge/pangeo-forge-recipes/issues/256, which is a current priority, would probably make this much more possible. I'll be curious to know if @rabernat agrees with this assessment of if I've overlooked something.

cisaacstern commented 2 years ago

Noting that https://github.com/pangeo-forge/cesm-atm-025deg-feedstock/issues/2 would benefit from a similar feature (failing gracefully in the case of missing files).