pytroll / satpy

Python package for earth-observing satellite data processing
http://satpy.readthedocs.org/en/latest/
GNU General Public License v3.0
1.05k stars 290 forks source link

Composite generations fails if Scene DataArrays are in memory #2019

Open ghiggi opened 2 years ago

ghiggi commented 2 years ago

Describe the bug /modifiers/angles.py functions currently fails if the input xr.DataArray data are in numpy.ndarray instead of a dask.Array. This because in many /mofidifers/angles.py functions, chunks are often retrieved from the inputs with: data.chunks/data_arr.chunks/ data_arr.data.chunks.

If the low-level direct call of this function is fine to be restricted to expect only dask.Array (missing checks ...), then conversion to dask.Array should be at least imposed at high-level when i.e. assigning a DataArrayto a Scene, to ensure that downstream computation (i.e. composites generations) do not fail when Scene DataArrays are in memory. See a classical example failing here below:

To Reproduce

import satpy 
from satpy import Scene 

fname = "/OR_ABI-L1b-RadF-M6C*_G16_s20201871100222_*_*.nc" # your fpath 
fpattern = os.path.join(dirpath, fname)
fpaths = glob.glob(fpattern)

scn = Scene(reader='abi_l1b', filenames=fpaths)
selected_channels = ["C01","C02","C03"] 
scn.load(selected_channels, calibration="reflectance")    # does not matter 
new_scn = scn.resample(scn.finest_area(), resampler='native') # does not matter
new_scn = new_scn[slice(10000,10200), slice(10000, 10200)]    # small to speed up 
new_scn['C01'] = new_scn['C01'].compute() # DataArray has numpy array hereafter
new_scn.load(['true_color']) # THIS FAIL 

Expected behavior Generation of composites should work also if the Scene DataArrays have data in memory

Actual results AttributeError: 'numpy.ndarray' object has no attribute 'chunks'

Environment Info:

djhoese commented 2 years ago

Satpy does not support numpy arrays or DataArrays backed by numpy arrays. It supports DataArrays backed by dask arrays only. If you'd like to achieve an almost-equivalent operation to what you are doing use .persist() instead of .compute().

ghiggi commented 2 years ago

It would not make sense in https://github.com/pytroll/satpy/blob/367d51d849fa117b1b2241d883ffc61286ee31ba/satpy/scene.py#L756 to check that the value is a xr.DataArray with da.Array?

Like adding the following line of code:

if not isinstance(value.data, da.Array):
    value.data = da.from_array(value.data) ` 

For dask newbies and xarray users, the compute method is better known than persist ...

djhoese commented 2 years ago

Sure. Yes. It could be added. Is it there now? No. Was it something we were concerned about when we were building Satpy originally? Not at the time. Would I accept a pull request to work around this? YES!

However, I'm wondering if there are other options besides using da.from_array that we should consider? An error message? Or is da.from_array "good enough". Any thoughts @pytroll/core?

For dask newbies and xarray users, the compute method is better known than persist ...

I don't disagree with that and I wasn't trying to say "you should have known about this". I was saying as a workaround for the issue you have brought up, you could use persist instead of compute.

ghiggi commented 2 years ago

Ah! Exploiting the possible discussion ... another possible small improvement I noticed to getter/setters method is to do a small change to Scene.__getitem__ method or DatasetDict.__getitem__ method so that the returned DataArray has a name equivalent to the key (if key is a string) or key.name (if key is a DataID).

djhoese commented 2 years ago

I may be misunderstanding what you're going for, but the reason DataIDs are used is that string names alone are not enough to uniquely describe products in Satpy (ex. Band 1 at 250m spatial resolution, 500m, or 1km; each one is a separate DataID). Additionally, modifying inputs at __setitem__/__getitem__ are going to make debugging problems very difficult. It would also require copying .attrs as to not modify the original which will be another source of confusing to users.

I'm starting to lean towards either:

  1. Short term "solution": Raise an exception (ValueError) if the __setitem__ value is not a DataArray[dask] object. This could optionally suggest a utility function for converting to DataArray[dask].
  2. Long term solution: Add checks/interfaces to composites, resampling, enhancements, and writers that either raise an exception or do the temporary conversion. The conversion would be to a DataArray[dask] object but then back to the original data type. However, this is a huge change on the intended use and support of data in Satpy. Pyresample has goals of these types of conversions but pyresample also existing support for numpy arrays. Satpy has never said it would or should work with anything other than DataArray[dask] objects. Supporting this will be a major burden on maintenance of the library.