google / Xee

An Xarray extension for Google Earth Engine
Apache License 2.0
240 stars 28 forks source link

cache=True not working when calling xarray.open_dataset #111

Closed brookisme closed 9 months ago

brookisme commented 9 months ago

I think there is a bug in the caching or I am misunderstanding how it should work. I am creating a dataset as shown below. Note the arg cache=True. However every time I do this ds.B8.values it takes a handful of seconds to complete. To get around this I am creating a new dataset directly from the values with this method.

Additionally, Note below I tried 2 versions of creating the uncached-dataset. The first I didn't filter by the area first and it took much much longer. I think this is probably expected but I was a bit surprised since xr.open_dataset is querying over a specific geom.

import ee
ee.Initialize()
import xarray as xr

IC=ee.ImageCollection("COPERNICUS/S2_HARMONIZED")
GEOM=ee.Geometry.Rectangle(-92.38201846776445,34.10974829658343,-92.38097240624865,34.11021909957634)
SCALE=10
EE_CRS='EPSG:3857'

IC=IC.filterDate('2021-01-01','2022-01-01').filterBounds(GEOM).map(lambda im: ee.Image(im).normalizedDifference(['B8','B4']).rename(['ndvi']))

def get_ee_xrr(ic,geom):
    xrr=xr.open_dataset(
        ic,
        engine='ee',
        crs=EE_CRS,
        scale=SCALE,
        geometry=geom,
        cache=True)
    return xrr.chunk().sortby('time')

def cache_ds(ds,bands=['ndvi']):
    attrs=ds.attrs
    coords=ds.coords
    data={}
    for b in bands:
        data[b]=xr.DataArray(
            attrs=attrs,
            coords=coords,
            data=ds[b].values)
    return xr.Dataset(data) 
%time ds=get_ee_xrr(IC,GEOM)
# Wall time: 1.38 s
%time ds_cached=cache_ds(ds)
# Wall time: 863 ms
%timeit ds.ndvi.values
# 558 ms ± 57.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ds_cached.ndvi.values
# 5.79 µs ± 19.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

This is huge increase. Note if I only needed the values once it would be faster not to cache and in this toy example waiting a fraction of a second doesn't make much of a difference but once you start using larger geometries and doing cloud masking accessing the data multiple times becomes a big problem.

Is this expected behavior? Am I misunderstanding the cache param in xr.open_dataset? Is there another way to keep the downloaded data without explicitly recreating the dataset?

Thanks, Brookie

KMarkert commented 9 months ago

You can call .load() on the original dataset to make it a one-time request from Earth Engine and get the data into memory so that when you get the values for subsequent processing it doesn't make the calls EE. This will request the entire collection in the ds object so make sure there is enough memory :)

I don't think the cached=True kwarg will work as expected because (if I understand xarray caching correctly) it will try to load the arrays from a datastore, Earth Engine is a virtual datastore with requests so xarray doesn't know what to cache (or maybe the requests, not the request results, are stored) .

My testing:

%time ds_cached=cache_ds(ds)
Wall time: 559 ms
%time ds.load()
#Wall time: 485 ms
%timeit ds.ndvi.values
# 8.14 µs ± 76.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit ds_cached.ndvi.values
# 9.16 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Note: Earth Engine does caching on its servers so any subsequent calls may be a little quicker but using .load() will be best if you want to make subsequent processing quicker.

jdbcode commented 9 months ago

Flagging for clarification in documentation

brookisme commented 9 months ago

Thanks Kel - .load() works!