How can I get one pixel time series in most efficient way?

Drfengze commented 1 year ago

like it on the google earth engine. It is so hard to do so on your service. I need to open all the data I need then subset. It is really unfriendly process. I appreciate your help in improving this part.

TomAugspurger commented 1 year ago

What have you tried so far, and what's causing issues? Happy to help more if you can provide more details, like what data set you're using.

At a high level, you'll want to

Find the STAC items intersecting with your area (pixel?) of interest
Load the data using a library like stackstac or odc-stac

Depending on the size of the timeseries, you might be best off using our stac-geoparquet snapshots to load the items rather than the STAC API. Either should get you to the same place though.

Drfengze commented 1 year ago

thanks for your fast reply! I am working on CMIP6. Maybe I can not use STAC to subset this dataset because it is not swatch-like, but each image is representing the whole world. So I have to access to the whole asset to get subsets. here are my codes:


catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace,
)
collection = catalog.get_collection("nasa-nex-gddp-cmip6")
%%time
import cftime
import logging
def to_360day_daily(da):
    '''Takes a DataArray. Change the calendar to 360_day and precision to daily.'''
    val = da.copy()
    time1 = da.time.copy()
    for itime in range(val.sizes['time']):
        bb = val.time.values[itime].timetuple()
        time1.values[itime] = cftime.Datetime360Day(bb[0], bb[1], bb[2])
    val = val.swap_dims({'time': 'time360'}).drop('time')
    return val

# import dask
# from dask.distributed import Client
# client = Client(threads_per_worker=1,memory_limit=0,silence_logs=logging.ERROR)

var_ls = collection.summaries.get_list("cmip6:variable")
scenario_ls = ['ssp245', 'ssp585']
model_ls = collection.summaries.get_list("cmip6:model")
var_data = []

for scenario in scenario_ls:
    for var in var_ls:
        model_data = []
        for model in model_ls:
            try:
                search = catalog.search(
                    collections=["nasa-nex-gddp-cmip6"],
                    query={"cmip6:model": {"eq": model},"cmip6:scenario": {"eq":scenario}},
                )
                items = search.get_all_items()
                if len(items) == 0: continue
                ts = xr.open_mfdataset(
                    [fsspec.open(item.assets[var].href).open() for item in items],decode_times=False,parallel=True
                )
                decoded_ts = xr.decode_cf(ts)
                model_ts = decoded_ts[var].sel(lat=slice(33, 38), lon=slice(126, 130))
                model_ts = to_360day_daily(model_ts)
                model_data.append(model_ts)
            except KeyError:
                print(f"Skipping variable '{var}' in model '{model}'")
                continue
        if model_data:
            model_average = xr.concat(model_data, dim='model').mean(dim='model')
            var_data.append((scenario, var, model_average))

# Convert list to xarray Dataset
var_data_ds = xr.Dataset(
    {
        (scenario, var): model_average
        for scenario, var, model_average in var_data
    }
)

Drfengze commented 1 year ago

I saw this in the official code for CMIP

Using a Reference File[¶]
Note: the approach described here is experimental and may change without warning.

In the previous section, we created a single xarray.Dataset from many NetCDF files (either many variables, or a timeseries for a single variable). Reading the metadata of a NetCDF / HDF5 file over the network is somewhat slow, making the open_mfdataset operation take about 20-30 seconds, just to read the metadata.

So in addition to the NetCDF files, we provide a reference file which stores the positions of each variable in each NetCDF file's binary stream. This reference file can be opened with fsspec and zarr and used normally with xarray.

But unfortunately, it just provide very small part of CMIP dataset, I still need to use xr.open_mfdataset to open every file.

TomAugspurger commented 1 year ago

Mmm, this might be challenging for this dataset. First, it's provided in HDF5 files, which are slow to open over the network. Second, it's chunked in time (I think with size 1), which is exactly the opposite of the chunking you want. Over in the pangeo project, we developed rechunker for this type of workflow.

I'll just note that we also have the Climate Impact Lab's CMIP6 downscaling in Zarr format. I believe that's chunked in both time and space. That might speed up things a bit, if the data is compatible with your use case.

Drfengze commented 1 year ago

@TomAugspurger Thank you, handsome answer, you saved my day~~ Zarr is the best

Drfengze commented 1 year ago

Oh no, I need some variable in the original dataset, I find that it is not in the downscaled one. Also could you please supply monthly data, daily data is too huge. I will try the rechunker to see whether it works

Drfengze commented 1 year ago

I gave up

TomAugspurger commented 1 year ago

Oh no, I need some variable in the original dataset, I find that it is not in the downscaled one.

I might be wrong, but I think these are both downscaling products, just from different groups.

Also could you please supply monthly data, daily data is too huge.

I think the upstream data provider (Climate Impact Lab) only produced daily data in this case.

microsoft / PlanetaryComputerExamples

How can I get one pixel time series in most efficient way? #276