intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
74 stars 36 forks source link

open in xarray without dask? #138

Closed rsignell closed 8 months ago

rsignell commented 11 months ago

I have a kerchunked dataset that loads in about 20s if I use Dask, and about 1s if I don't:

import fsspec
import xarray as xr

combined_parquet_aws = 's3://usgs-coawst/useast-archive/combined.parq'

fs_ref = fsspec.implementations.reference.ReferenceFileSystem(
    combined_parquet_aws, remote_protocol="s3", target_protocol="s3", lazy=True)

# Method 1 (with Dask) -- takes 15-30s:
ds = xr.open_dataset(
    fs_ref.get_mapper(), engine="zarr", drop_variables=['dstart'],
    backend_kwargs={"consolidated": False}, chunks={})

# Method 2 (no Dask) -- takes 1-3s:
ds = xr.open_dataset(
    fs_ref.get_mapper(), engine="zarr", drop_variables=['dstart'],
    backend_kwargs={"consolidated": False})

When I want to use Intake to open into Xarray, I have always used to_dask() (Method 1):

import intake
intake_catalog_url = 's3://usgs-coawst/useast_archive/coawst_useast.yml'
cat = intake.open_catalog(intake_catalog_url)
coawst = cat['COAWST_USEAST_Archive']
ds = coawst.to_dask() 

I tried .to_chunked() and it took the same amount of time as .to_dask()

How can I specify Method 2 using Intake (and get the datasets opening in a few seconds intead of 15-30!)?

martindurant commented 11 months ago

Of course, there are a number of PRs in flight to get the dask open time much closer to the non-dask one, so part of the answer is "wait".

However, .read() I think gives you the regular un-dask xarray object, with the usual lazy access on the variables.

martindurant commented 11 months ago

Refs:

rsignell commented 11 months ago

However, .read() I think gives you the regular un-dask xarray object, with the usual lazy access on the variables.

I tried .read() and I let it run for about 1 minute before killing it. Seemed like it was loading the data!

martindurant commented 11 months ago

Mm, OK. Then you can do instead:

coawst.chunks = None
coawst.discover()
ds = coawst._ds
rsignell commented 11 months ago

Tried it. Also takes 20s:

intake_catalog_url = 's3://usgs-coawst/useast_archive/coawst_useast.yml'
cat = intake.open_catalog(intake_catalog_url)
coawst = cat['COAWST_USEAST_Archive']
coawst.chunks = None
coawst.discover()
ds = coawst._ds
rsignell commented 8 months ago

This now takes about 1 s, so closing!