gjoseph92 / stackstac

Turn a STAC catalog into a dask-based xarray
https://stackstac.readthedocs.io
MIT License
232 stars 49 forks source link

stackstac 0.4.3 to 0.4.4 creating an issue #206

Closed Berhinj closed 1 year ago

Berhinj commented 1 year ago

To be 100% honest I don't exactly know why I have this error.

So I have this code which basically fetches some HLS data from cmr stac catalog. The code works under stackstac 0.4.3 but not on 0.4.4.

It seem to be related to the numpy version, I tried numpy 1.23.5, 1.24.2, 1.25.2 => same issue so I'm afraid the issue somehow comes from stackstac. Can someone help me?

# FOR MORE INFO ON THE COLLECTIONS https://lpdaac.usgs.gov/products/eco_l2_lstev002/

# HLSL30.v2.0,HLSS30.v2.0
import pystac_client
import stackstac
import os

# Frist setting up access/credentials to cmr
urs = 'urs.earthdata.nasa.gov'  # Earthdata URL endpoint for authentication
netrc_name = ".netrc"
netrcDir = os.path.expanduser(f"~/{netrc_name}")
_ = netrc(netrcDir).authenticators(urs)[0]

env = dict(
    GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR', 
    # AWS_NO_SIGN_REQUEST='YES',
    # GDAL_MAX_RAW_BLOCK_CACHE_SIZE='200000000',
    # GDAL_SWATH_SIZE='200000000',
    # VSI_CURL_CACHE_SIZE='200000000',
    GDAL_HTTP_COOKIEFILE=os.path.expanduser('~/cookies.txt'),
    GDAL_HTTP_COOKIEJAR=os.path.expanduser('~/cookies.txt')
)
os.environ.update(env)

# now let's fetch some data
href = 'https://cmr.earthdata.nasa.gov/stac/LPCLOUD'
c = pystac_client.Client.open(href)

bbox = [-90.47116362,  33.50472389, -90.37408012,  33.54915847]
params = dict(
    collections=["HLSL30.v2.0"],
    bbox=bbox,
    datetime='2023-06-01/2023-06-15', #'2023-01-23/2023-05-23'

)

s = c.search(**params)
_ = s.matched()

# stack = stackstac.stack(s.item_collection(), epsg=32613)
xs = s.item_collection()

stac_img = stackstac.stack(xs, assets=["B03"], resolution=30, epsg=32615)
stac_img[0, 0, :1000, :1000].plot.imshow()

which gives me the following error

---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
Cell In[15], line 24
     21 xs = s.item_collection()
     23 stac_img = stackstac.stack(xs, assets=["B03"], resolution=30, epsg=32615)#, snap_bounds=False, xy_coords="topleft")
---> 24 stac_img[0, 0, :100, :100].plot.imshow()

File ~/conda/envs/rxr3/lib/python3.11/site-packages/xarray/plot/plot.py:1308, in _plot2d.<locals>.plotmethod(_PlotMethods_obj, x, y, figsize, size, aspect, ax, row, col, col_wrap, xincrease, yincrease, add_colorbar, add_labels, vmin, vmax, cmap, colors, center, robust, extend, levels, infer_intervals, subplot_kws, cbar_ax, cbar_kwargs, xscale, yscale, xticks, yticks, xlim, ylim, norm, **kwargs)
   1306 for arg in ["_PlotMethods_obj", "newplotfunc", "kwargs"]:
   1307     del allargs[arg]
-> 1308 return newplotfunc(**allargs)

File ~/conda/envs/rxr3/lib/python3.11/site-packages/xarray/plot/plot.py:1172, in _plot2d.<locals>.newplotfunc(darray, x, y, figsize, size, aspect, ax, row, col, col_wrap, xincrease, yincrease, add_colorbar, add_labels, vmin, vmax, cmap, center, robust, extend, levels, infer_intervals, colors, subplot_kws, cbar_ax, cbar_kwargs, xscale, yscale, xticks, yticks, xlim, ylim, norm, **kwargs)
   1169 yval = yval.to_numpy()
   1171 # Pass the data as a masked ndarray too
-> 1172 zval = darray.to_masked_array(copy=False)
   1174 # Replace pd.Intervals if contained in xval or yval.
   1175 xplt, xlab_extra = _resolve_intervals_2dplot(xval, plotfunc.__name__)

File ~/conda/envs/rxr3/lib/python3.11/site-packages/xarray/core/dataarray.py:2811, in DataArray.to_masked_array(self, copy)
   2797 def to_masked_array(self, copy: bool = True) -> np.ma.MaskedArray:
   2798     """Convert this array into a numpy.ma.MaskedArray
   2799 
   2800     Parameters
   (...)
   2809         Masked where invalid values (nan or inf) occur.
   2810     """
-> 2811     values = self.to_numpy()  # only compute lazy arrays once
   2812     isnull = pd.isnull(values)
   2813     return np.ma.MaskedArray(data=values, mask=isnull, copy=copy)

File ~/conda/envs/rxr3/lib/python3.11/site-packages/xarray/core/dataarray.py:658, in DataArray.to_numpy(self)
    647 def to_numpy(self) -> np.ndarray:
    648     """
    649     Coerces wrapped data to numpy and returns a numpy.ndarray.
    650 
   (...)
    656     DataArray.data
    657     """
--> 658     return self.variable.to_numpy()

File ~/conda/envs/rxr3/lib/python3.11/site-packages/xarray/core/variable.py:1104, in Variable.to_numpy(self)
   1102 # TODO first attempt to call .to_numpy() once some libraries implement it
   1103 if isinstance(data, dask_array_type):
-> 1104     data = data.compute()
   1105 if isinstance(data, cupy_array_type):
   1106     data = data.get()

File ~/conda/envs/rxr3/lib/python3.11/site-packages/dask/base.py:315, in DaskMethodsMixin.compute(self, **kwargs)
    291 def compute(self, **kwargs):
    292     """Compute this dask collection
    293 
    294     This turns a lazy Dask collection into its in-memory equivalent.
   (...)
    313     dask.base.compute
    314     """
--> 315     (result,) = compute(self, traverse=False, **kwargs)
    316     return result

File ~/conda/envs/rxr3/lib/python3.11/site-packages/dask/base.py:600, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    597     keys.append(x.__dask_keys__())
    598     postcomputes.append(x.__dask_postcompute__())
--> 600 results = schedule(dsk, keys, **kwargs)
    601 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File ~/conda/envs/rxr3/lib/python3.11/site-packages/dask/threaded.py:89, in get(dsk, keys, cache, num_workers, pool, **kwargs)
     86     elif isinstance(pool, multiprocessing.pool.Pool):
     87         pool = MultiprocessingPoolExecutor(pool)
---> 89 results = get_async(
     90     pool.submit,
     91     pool._max_workers,
     92     dsk,
     93     keys,
     94     cache=cache,
     95     get_id=_thread_get_id,
     96     pack_exception=pack_exception,
     97     **kwargs,
     98 )
    100 # Cleanup pools associated to dead threads
    101 with pools_lock:

File ~/conda/envs/rxr3/lib/python3.11/site-packages/dask/local.py:511, in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
    509         _execute_task(task, data)  # Re-execute locally
    510     else:
--> 511         raise_exception(exc, tb)
    512 res, worker_id = loads(res_info)
    513 state["cache"][key] = res

File ~/conda/envs/rxr3/lib/python3.11/site-packages/dask/local.py:319, in reraise(exc, tb)
    317 if exc.__traceback__ is not tb:
    318     raise exc.with_traceback(tb)
--> 319 raise exc

File ~/conda/envs/rxr3/lib/python3.11/site-packages/dask/local.py:224, in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    222 try:
    223     task, data = loads(task_info)
--> 224     result = _execute_task(task, data)
    225     id = get_id()
    226     result = dumps((result, id))

File ~/conda/envs/rxr3/lib/python3.11/site-packages/dask/core.py:119, in _execute_task(arg, cache, dsk)
    115     func, args = arg[0], arg[1:]
    116     # Note: Don't assign the subtask results to a variable. numpy detects
    117     # temporaries by their reference count and can execute certain
    118     # operations in-place.
--> 119     return func(*(_execute_task(a, cache) for a in args))
    120 elif not ishashable(arg):
    121     return arg

File ~/conda/envs/rxr3/lib/python3.11/site-packages/dask/core.py:119, in <genexpr>(.0)
    115     func, args = arg[0], arg[1:]
    116     # Note: Don't assign the subtask results to a variable. numpy detects
    117     # temporaries by their reference count and can execute certain
    118     # operations in-place.
--> 119     return func(*(_execute_task(a, cache) for a in args))
    120 elif not ishashable(arg):
    121     return arg

File ~/conda/envs/rxr3/lib/python3.11/site-packages/dask/core.py:119, in _execute_task(arg, cache, dsk)
    115     func, args = arg[0], arg[1:]
    116     # Note: Don't assign the subtask results to a variable. numpy detects
    117     # temporaries by their reference count and can execute certain
    118     # operations in-place.
--> 119     return func(*(_execute_task(a, cache) for a in args))
    120 elif not ishashable(arg):
    121     return arg

File ~/conda/envs/rxr3/lib/python3.11/site-packages/dask/optimization.py:990, in SubgraphCallable.__call__(self, *args)
    988 if not len(args) == len(self.inkeys):
    989     raise ValueError("Expected %d args, got %d" % (len(self.inkeys), len(args)))
--> 990 return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

File ~/conda/envs/rxr3/lib/python3.11/site-packages/dask/core.py:149, in get(dsk, out, cache)
    147 for key in toposort(dsk):
    148     task = dsk[key]
--> 149     result = _execute_task(task, cache)
    150     cache[key] = result
    151 result = _execute_task(out, cache)

File ~/conda/envs/rxr3/lib/python3.11/site-packages/dask/core.py:119, in _execute_task(arg, cache, dsk)
    115     func, args = arg[0], arg[1:]
    116     # Note: Don't assign the subtask results to a variable. numpy detects
    117     # temporaries by their reference count and can execute certain
    118     # operations in-place.
--> 119     return func(*(_execute_task(a, cache) for a in args))
    120 elif not ishashable(arg):
    121     return arg

File ~/conda/envs/rxr3/lib/python3.11/site-packages/stackstac/to_dask.py:185, in fetch_raster_window(reader_table, slices, dtype, fill_value)
    178 # Only read if the window we're fetching actually overlaps with the asset
    179 if windows.intersect(current_window, asset_window):
    180     # NOTE: when there are multiple assets, we _could_ parallelize these reads with our own threadpool.
    181     # However, that would probably increase memory usage, since the internal, thread-local GDAL datasets
    182     # would end up copied to even more threads.
    183 
    184     # TODO when the Reader won't be rescaling, support passing `output` to avoid the copy?
--> 185     data = reader.read(current_window)
    187     if all_empty:
    188         # Turn `output` from a broadcast-trick array to a real array, so it's writeable
    189         if (
    190             np.isnan(data)
    191             if np.isnan(fill_value)
    192             else np.equal(data, fill_value)
    193         ).all():
    194             # Unless the data we just read is all empty anyway

File ~/conda/envs/rxr3/lib/python3.11/site-packages/stackstac/rio_reader.py:405, in AutoParallelRioReader.read(self, window, **kwargs)
    403 scale, offset = reader.scale_offset
    404 if scale != 1:
--> 405     result *= scale
    406 if offset != 0:
    407     result += offset

File ~/conda/envs/rxr3/lib/python3.11/site-packages/numpy/ma/core.py:4344, in MaskedArray.__imul__(self, other)
   4342 other_data = getdata(other)
   4343 other_data = np.where(self._mask, other_data.dtype.type(1), other_data)
-> 4344 self._data.__imul__(other_data)
   4345 return self

UFuncTypeError: Cannot cast ufunc 'multiply' output from dtype('float64') to dtype('int16') with casting rule 'same_kind'
gjoseph92 commented 1 year ago

Thanks for reporting @Berhinj. My guess is this bug was always there, but was uncovered by https://github.com/gjoseph92/stackstac/pull/183.

I think https://github.com/gjoseph92/stackstac/pull/208 should fix it. However, I don't have access to NASA Earthdata, so your example isn't reproducible for me. Could you test it out using the fix and confirm it works, or alternatively, provide a reproducer using publicly-available data?

Berhinj commented 1 year ago

I haven't had the issue on another stac/data unfortunately.

But I'd be happy to investigate more. I'm at this point where I'm relying on stackstac too much not to deep dive into the code.

Thanks a lot for pointing out potential fix already, it makes it easier already.

gjoseph92 commented 1 year ago

@Berhinj have you gotten the chance to try #202 on your data, and confirm whether that fixes the issue for you?

EDIT: going to merge it now since I'm pretty sure this fixes the issue. Would still be good to get confirmation of that though!

Berhinj commented 1 year ago

Just tried it, yes it gets my issue fixed - thanks a lot !

gjoseph92 commented 1 year ago

Great, thanks for verifying!