opendatacube / odc-geo

GeoBox and geometry utilities extracted from datacube-core
https://odc-geo.readthedocs.io/en/latest/
Apache License 2.0
81 stars 13 forks source link

possible bug in `save_cog_with_dask`? #185

Open jerry-kobold opened 16 hours ago

jerry-kobold commented 16 hours ago

At my company we have been using save_cog_with_dask very heavily to produce COGs from very large rasters. For certain raster/chunk combinations we have observed the function failing with an AssertionError. I did some digging and isolated the problem to the following code:

# odc/geo/cog/_mpu.py:190

def can_flush(pw: PartsWriter):
    if self.write_credits < 1:
        return False
    if self.started_write:
        return self.is_final or len(data) >= pw.min_write_sz
    if self.is_final:
        return len(data) > self.lhs_keep
    return len(data) - self.lhs_keep >= pw.min_write_sz

# odc/geo/cog/_mpu.py:207
assert can_flush(write)

It seems that for certain combinations of raster dimensions and chunking scheme, a situation can arise in which self.write_credits is 0 but the write has not completed. The following code below should reproduce the problem:

import xarray as xr
import dask.array as da
import numpy as np
from odc.geo.cog import save_cog_with_dask

ylen = 43284
xlen = 9615
bands = 2
fake_data = da.random.default_rng().integers(1, 1000, size=(bands, ylen, xlen), chunks=(1, 4000, 4000), dtype=np.int32)
fake_raster = xr.DataArray(fake_data, coords={'band': [1, 2], 'y': np.arange(ylen), 'x': np.arange(xlen)}).rio.write_crs('EPSG:4326')
output = Path('tiff-cog-test.tiff')
if output.exists():
    output.unlink()

fut = save_cog_with_dask(fake_raster, str(output), compression='deflate')
fut.compute()

The dimensions in this example were chosen because they match the dimensions of an actual raster for which this problem arose with this specific chunking. Using the time-honored method of distributed.print-based debugging, I isolated the error to the line that checks for self.write_credits < 1. Playing around with the source, I noticed that changing this line to

if self.write_credits < 1 and not self.final

makes the error go away and nothing else seems to break, but I don't know if that's the right decision or not.

Kirill888 commented 12 hours ago

Maybe assert should be

assert self.final or can_flush(write)

jerry-kobold commented 9 hours ago

From what I can tell that's basically the same thing as what I did, right? Happy to make a PR for this if it would help.

edit: would it make more sense to move the if self.final check to the top of can_flush so that condition gets evaluated first? Or is there a scenario where that could cause problems?