pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.57k stars 1.07k forks source link

inconsistent mean computation #5785

Closed theobarnhart-USGS closed 2 years ago

theobarnhart-USGS commented 3 years ago

What happened: I was computing an objective function between a remotely sensed snow covered area dataset and a simulated snow covered area dataset and every time I ran the code I got a different answer. I narrowed the issue down to when the monthly mean is taken of an image stack (geotiffs here) and then the mean is taken of the monthly values.

Maybe there is something about using geotiffs in this way that I am missing? The issue also occurs when I use the NaNs in the geotiffs to insert NaNs in the proper places in the simulation output. When I don't "cascade" the NaNs, the issue goes away in the simulated data but it always persists in the geotiffs.

What you expected to happen: I was expecting the mean of means output to be the same ever iteration as the same image stack was used. Minimal Complete Verifiable Example: I can't share the image stack at this time, but here is a shortened version of the code.

test=[]
for i in tqdm.notebook.tqdm(range(2)):
    # load fSCA and assign dims
    ls = xr.open_mfdataset(files, parallel=True, chunks=dict(time=1,band=1, x=2500, y=2500),
                           preprocess=add_time_coord)
    ls['time'] = dates
    ls = ls.rename_vars(dict(band_data='fsca'))

    # threshold fSCA to SCA and cascade NaNs
    ls['sca'] = xr.where(ls.fsca >= 150, 1, 0) # compute sca from fsca

    test.append(ls.sca.groupby("time.month").mean().compute()) # monthly images

np.sqrt(np.mean(np.square((test[0].mean(('month'))-test[1].mean(('month'))).values))) 

>>> 0.004654084555997707

I expected the value here to be zero. Anything else we need to know?: The geotiffs have a no data value of -999 which is decoded to np.NaN

Environment:

Output of xr.show_versions() ------------------ commit: None python: 3.9.4 | packaged by conda-forge | (default, May 10 2021, 22:13:33) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 4.4.0-18362-Microsoft machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.18.2 pandas: 1.2.4 numpy: 1.20.3 scipy: 1.6.3 netCDF4: 1.5.6 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.3 cfgrib: None iris: None bottleneck: None dask: 2021.05.0 distributed: 2021.05.0 matplotlib: 3.4.2 cartopy: 0.19.0.post1 seaborn: 0.11.1 numbagg: None pint: None setuptools: 49.6.0.post20210108 pip: 21.1.2 conda: None pytest: None IPython: 7.23.1 sphinx: None
theobarnhart-USGS commented 3 years ago

I converted the geotiffs to netCDF and the problem seems to go away. Perhaps this has to do with how NaNs are decoded from geotiffs? I also tried this workflow with the rioxarray engine 'rasterio' and had the same issue. I'll do some more testing and report back.

andersy005 commented 2 years ago

@theobarnhart-USGS, did you find a solution and/or can we close this issue?

theobarnhart-USGS commented 2 years ago

Hi @andersy005, the only thing that I found was that I needed to convert the geotiffs to netcdf to get the computation to be consistent. I didn't trace it further so its probably best to close this.