icesat2py / icepyx

Python tools for obtaining and working with ICESat-2 data
https://icepyx.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
218 stars 107 forks source link

returning 'nan' when too many variables requested? #620

Closed derekpickell closed 2 weeks ago

derekpickell commented 1 month ago

Hi there,

I'm playing around with a basic read of locally downloaded .h5 files:

reader = ipx.Read(data_source=file_list)
reader.vars.append(beam_list=['gt2l', 'gt2r'], var_list=['h_li', "latitude", "longitude", "h_li_sigma", "sigma_geo_h", "bsnow_h", "cloud_flg_asr", "atl06_quality_summary"])
 ds = reader.load()

It seems when I add just one more variable to the list, e.g., 'h_rms_misfit', the number of 'nans' in the returned 'ds' xarray increases for no apparent reason, sometimes for all variables.

icepyx v1.3.0

Thank you!

JessicaS11 commented 3 weeks ago

Hello @derekpickell! I just wanted to acknowledge I saw this post (thanks for reaching out!) and am wondering why this is unexpected behavior? It could be that adding h_rms_misfit is increasing one of the dataset dimensions, which would tend to increase the number of nans as Xarray pads out the data to take this new shape.

A few questions that will make it easier for me to diagnose if there's an issue:

derekpickell commented 3 weeks ago

Hi @JessicaS11,

Thank you for the response! To answer your questions:

JessicaS11 commented 3 weeks ago

Thanks for these answers. I've dug in a bit more and now suspect that it is not the number of variables you're playing with, but which variables. The note on which ones you've experimented with was a clue. h_rms_misfit, bsnow_h, and cloud_flg_asr are all more deeply nested variables than (for instance) h_li (if you look at the variable paths, they have either geophysical or fit_statistics after the land_ice_segments layer. If you look at the resulting dataset for a single file after reading in two versus three of the above specific variables, the coordinates attached to the variable are different. What's happening behind the scenes is essentially icepyx is doing all of the individual group reads with xarray and then trying to cleverly merge the per-group dataarrays together into one dataset. As you've noted, this doesn't always work! Handling (generically) the multiple layers of nesting is an ongoing challenge in icepyx, so thanks for reporting this case we missed.

I think I've isolated where in the code the issue is happening (lines 816-822 or so in the read module, so could also be in one of the functions called therein), but I haven't yet figured out what the solution might be (any suggestions welcome!). I'll continue to work on resolving this as time allows, but any assistance would be greatly appreciated.

JessicaS11 commented 2 weeks ago

Hello @derekpickell! I have good news and bad news. Good news is the bug I identified where all dimensions were not being applied to the deeper nested variables of interest is fixed via #623. Bad news is I don't think this was actually the problem you noted.

When I dug in further, I found a granule that only has nan values for some variables. However, it seems like only bsnow_h fits into this category, not cloud_flg_asr or h_rms_misfit. If I'm not mistaken, in some situations the blowing snow algorithm is unable to confidently quantify blowing snow, which would result in no blowing snow values. @mikala-nsidc (ICESat-2 support specialist at NSIDC) or @tsutterley (one of the ATL06 product leads), can you confirm that in some cases no bsnow_h (and thus all nans) is expected behavior for ATL06 granules?

derekpickell commented 2 weeks ago

@JessicaS11 wow amazing thank you. It looks like everything 'makes sense' with the data I am looking at: few nans here and there, but no large gaps where I wouldn't expect them.

JessicaS11 commented 2 weeks ago

@derekpickell Excellent! I'm going to close this issue as resolved, but feel free to comment again if need be. Would you be able/willing to do a PR review for #623?