pangeo-data / cog-best-practices

Best practices with cloud-optimized-geotiffs (COGs)
BSD 3-Clause "New" or "Revised" License
77 stars 9 forks source link

concurrent reads with dask and xarray #2

Open scottyhq opened 3 years ago

scottyhq commented 3 years ago

I've never been clear on whether or not xarray+dask computations using threads are hampered by file locks. It's complicated because GDAL-->rasterio-->xarray-->dask are all involved.

Rasterio recently updated their example of concurrent processing. This example specifically reads files from local disk rather than over a network. https://rasterio.readthedocs.io/en/latest/topics/concurrency.html

There is some good discussion here about multiple threads reading the same file concurrently here https://github.com/pangeo-data/pangeo-example-notebooks/issues/21#issuecomment-432457955

And finally the rasterio mailing list has some relevant discussion https://rasterio.groups.io/g/main/topic/72528118#468

A simple example that shows timing improvement would go a long way in helping to clarify this for people. This might require a PR to xarray to deal with locks...

scottyhq commented 3 years ago

discussed this a bit with @TomAugspurger today, and just added a notebook with some concrete comparisons https://github.com/pangeo-data/cog-best-practices/blob/main/rasterio-concurrency.ipynb

scottyhq commented 3 years ago

@TomAugspurger thanks for the PR to rioxarray, it inspired me to put together another notebook with a few more details on aligning dask chunks with data store for the case of multiband COGs (your example NAIP image in Azure) https://github.com/pangeo-data/cog-best-practices/blob/main/4-threads-vs-async.ipynb

basically I'm trying to document how chunking decisions might depend on both 1. the data format / how bytes are organized in storage and 2. the hardware you're running on (how many CPUs).

in exploring the case of running some notebooks on mybinder.org which has just 1vCPU and 2GB RAM I became interested in the ability to still achieve high throughput with asyncio instead of multithreading. this is easy to achieve if bypassing GDAL entirely with a COG-specific library like https://github.com/geospatial-jeff/aiocogeo . It's unclear to me how to integrate that sort of access pattern with xarray / rioxarray, perhaps as a sepearate 'engine' with the ongoing xarray backend refactor?

also bigger picture trying to tie this all together in way that is straight-forward for a user to run computations on a collection of cogs as an Xarray dataset with max efficiency and minimal custom code (#4)

TomAugspurger commented 3 years ago

Thanks Scott, that's very useful. I wasn't aware of the different interleaving options. This could inform what a good default chunking should be for rioxarray.open_rasterio when the user specifies chunks=True (cc @snowman2).

still achieve high throughput with asyncio instead of multithreading.

Yeah, that's worth exploring more. A separate engine seems like a decent amount of work, but feels like the only way to really do it. I imagine that it could share a lot of the implementation of rioxarray, though I haven't looked closely. I wonder if anyone has experimented with an async wrapper for GDAL / rasterio.

snowman2 commented 3 years ago

Currently rioxarray.open_rasterio is slower opening the files due to using pyproj under the hood for CRS management: https://github.com/corteva/rioxarray/discussions/234

So, the speed gains for reading in parallel will likely only be realized for very large rasters.

scottyhq commented 3 years ago

A separate engine seems like a decent amount of work, but feels like the only way to really do it.

Right. I've had a hard time wrapping my head around this to be honest. It feels like there could be a separation of concerns between the xarray/rioxarray API and the filesystem I/O such as what @martindurant demoed with xarray+fsspec+zarr async access (https://www.youtube.com/watch?v=7XDBM3pW2ls).

But that leads to the question can rasterio handle file-like objects from fsspec? and I think the answer is No based on https://github.com/mapbox/rasterio/issues/977#issuecomment-642866233

martindurant commented 3 years ago

No, I don't think it can, and neither can some other drivers such as cfgrib. To deal with those via fsspec, we would need to leverage temporary local storage of the files. That gets tricky to organise with async, where we don't know beforehand which parts of the global dataset are to be accessed (unless xarray itself were to speak async, which is asking a lot).

kylebarron commented 3 years ago

in exploring the case of running some notebooks on mybinder.org which has just 1vCPU and 2GB RAM I became interested in the ability to still achieve high throughput with asyncio instead of multithreading. this is easy to achieve if bypassing GDAL entirely with a COG-specific library like geospatial-jeff/aiocogeo . It's unclear to me how to integrate that sort of access pattern with xarray / rioxarray, perhaps as a sepearate 'engine' with the ongoing xarray backend refactor?

I was going to specifically mention aiocogeo if it hadn't already been mentioned. The upsides are that it could potentially integrate well with Dask? From a cursory read of the Dask docs, it seems to be able to integrate with async code? Downsides are that I don't believe aiocogeo handles reprojection: just loading and parsing data, so you'd need to implement that manually or still bring in pyproj/GDAL for that I think.