`quantiles.py` produces `edfcsv` even though `impact` file might be entirely absent in a given climate model

emileten commented 3 years ago

Hi @brews -- I think I can't assign anyone because I haven't been invited to this repository yet. Would you mind doing it ? Thank you ! Also if you think it's better that James looks at this, feel free to redirect this issue (I wasn't sure who to assign).

I want to extract the gcm-distribution from some impact netcdf. This netcdf is contained in all but one gcm's target dirs. The current behavior of the code is that it skips empty target directories, and doesn't even print a warning - I obtain misleading numbers as a consequence.

The function that verifies the validity of the target directory and skips it if it doesn't contain the file is lib.configs.iterate_valid_targets(). The lines causing the issue start here : https://github.com/jrising/prospectus-tools/blob/4d5244267cfa5e71192a6719086c6cf3bdd58156/gcp/extract/lib/configs.py#L131

Given that the lines above are executed only if the target directory falls in the extraction request of the user, I think it is desirable that the code throws an error if one of the target directory is missing the netcdf file, or at least, that there is a boolean parameter allowing the user to have this protective behavior.

Here is just an idea that I tested :

allow for a 'ignore-empty-dirs' in the config.
In lib.configs.iterate_valid_targets(), declare ignore_empty_dirs=config.get('ignore-empty-dirs', True)

in the same function, expand the

if impact + ".nc4" in os.listdir(multipath(targetdir, impact)):
...

to

if impact + ".nc4" in os.listdir(multipath(targetdir, impact)):
...
else:
error = 'there is no ' + impact + '.nc4 file for ' + batch + ' ' +  rcp +  '  '  + model +  '  '  + iam +  '  '  + ssp
if not ignore_empty_dirs:
 raise error

And it throws an informing error.

Thank you very much ! I hope you have a great end of the week.

jrising commented 3 years ago

@brews I'm ambivalent about a solution to this problem. It does not need to be the job of quantiles.py to check whether all GCMs are accounted for. The case described seems fairly esoteric (the directory exists, but not the file?), and the more general requirement that "all" GCMs be required would have to be given the list of GCMs (different analyses aim for different "complete" sets of GCMs). But as long as the current behavior remains the default, any solution seems fine with me.

emileten commented 3 years ago

@jrising @brews thanks for reacting to this. It's indeed about a particular case, i.e outside of the code system, since the files I am trying to aggregate were not produced by the projection system itself, but by some independent code (Andy's code for costs in agriculture).

brews commented 3 years ago

@emileten I think that @jrising raises a really good point. It's generally tricky to read the user's mind and identify which GCMs are and are not missing -- especially because each sector can have its own tedious rules on which GCMs are used in combination with other IAMs, SSPs, etc. I'd lean towards marking this as something we won't be able to fix. Right now I feel like fixing it right requires too much "magic" (which will likely break in the future).

@emileten are you able to work around this if we don't fix it here?

emileten commented 3 years ago

@brews thanks, I understand. For now I already have a workaround yes.

But I wrote this issue for the future. Although my case is particular, I think similar cases might happen in the future. I see what you mean when you talk about sector's specificities. Here specificities are determined by which directories exist and which GCM-weights are used. The reason I wrote this issue is that I found it misleading that the code continues to work normally although there is a positive weight assigned to a GCM for which no file (of the type requested) exists, while the directory for that GCM exists. Maybe a less 'magical' check could happen when the weights are applied to the data loaded. If there are unmatched non-zero weights, throw an error.

jrising commented 3 years ago

@emileten I have no problem with this being added as an optional feature. I was trying to find a place in the documentation to add a line like, "The quantiles.py script produces results even for incomplete datasets." but we don't have a gentle introduction to the tool. Maybe that would be the solution here, so that the performance of the tool is not misleading?

emileten commented 3 years ago

Thanks @jrising. I agree either solutions are good (optional feature, or add this output restriction in the documentation). I am happy to help implementing either, if needed.

jrising commented 3 years ago

@emileten I believe that @brews is working on a stand-alone version of the quantiles.py (and related?) tool(s), so we can add the documentation in there.

My suggestion for a coding solution is to add a config option: require-gcms: [...], where if any of the listed GCMs is missing, we drop the set it's in. There's a challenge of defining what the set it's in is though: the data structure we build up in results.py::sum_into_data uses keys defined as the tuple (batch, GCM, IAM). So, the idea is that after we've collected all of the data, we look through unique (batch, *, IAM) combinations and drop those without the required GCMs.

I would be happy for you to submit a PR with that.

emileten commented 3 years ago

thanks @jrising, I am working on that.

jrising / prospectus-tools

`quantiles.py` produces `edfcsv` even though `impact` file might be entirely absent in a given climate model #36