Optimize Data Reuse in StoreToZarr and Similar Processes

Description

Currently, the process of rechunking data and ingesting it into a Zarr store encounters significant performance bottlenecks, primarily due to inefficient data reading strategies. A substantial amount of effort and network requests may be wasted on rereading byte ranges from the source datasets. This inefficiency is particularly pronounced in workflows that involve transferring large datasets over the network or reading from slow storage systems.

The Problem

When reading data from source files (e.g., NetCDF, Zarr) to write into an output Zarr store, the current implementation does not effectively reuse data that has already been read into memory. Instead, overlapping byte ranges might be read multiple times from the same or different source files, leading to unnecessary I/O operations and increased execution time.

This issue is compounded in scenarios where the chunking scheme of the output store does not align with that of the input files, necessitating partial reads of larger chunks and leading to both inefficiencies in data transfer and increased memory usage.

Ideas

Intelligent Caching: Implement a caching mechanism that temporarily stores read chunks in memory. Subsequent write operations requiring the same byte ranges could utilize this cache, reducing the need for additional reads from the source. We already do byte caching via fsspec constructs - might be able to use the DoFn lifecycle to ensure that cache is shared across units of work within a worker
Graph-based Data Dependency Analysis: Construct a graph that models the dependencies between read and write operations. Nodes in the graph represent chunks in both the input and output datasets, while edges denote the data flow from read chunks to write chunks. Optimizing this graph could help in scheduling reads and writes in a way that maximizes data reuse. There is surely prior art on this - anyone familiar?
Heuristic-based Read Scheduling: Develop heuristics for read scheduling that prioritize the use of data already in memory as a further optimization so that LRU or similar cache invalidation is sensible

Illustration of caching strategy

Consider a scenario where two adjacent write chunks (W1) and (W2) in the output Zarr store depend on overlapping ranges of a read chunk [R1] from a source file. Currently, the overlapping portion of [R1] might be read twice, once for each write operation. An optimized approach would read [R1] once, cache it, and then use the cached data for both (W1) and (W2), effectively halving the read operations for this segment.

Read: [R1] -----> [Cache]
                     |
                    / \
Write:           (W1) (W2)

We can likely use the DoFn setup step which can initialize shared resources (even across bundles of work!) within a worker for a given stage of pipeline execution

pangeo-forge / pangeo-forge-recipes