stactools-packages / esa-cci-lc

stactools package for ESA's Climate Change Initiative (CCI) Land Cover (LC) product
Other
6 stars 0 forks source link

Don't use GDAL's Python bindings #3

Closed gadomski closed 2 years ago

gadomski commented 2 years ago

I don't think it's explicitly documented anywhere, but we generally try to avoid GDAL's Python bindings, as they often cause environment setup issues; we prefer rasterio and/or xarray to read raster data. I see GDAL is used for the COG conversion: https://github.com/stactools-packages/esa-cci-lc/blob/f3a4afc3b4a08c1284350d33403c4a30a11d3803/src/stactools/esa_cci_lc/cog.py#L105

Could this be done with rasterio? Or was rasterio causing the performance issues we discussed earlier?

FWIW, I tried a simple pip install -e . && pip install -r requirements-dev.txt && pytest and got GDAL Python binding issues, probably because gdal wasn't specified as a direct dependency in setup.cfg. However, if possible, it'd be better to use rasterio/xarray IMO.

m-mohr commented 2 years ago

Is there a Python API binding equivalent to rio convert? I couldn't find one, but maybe I just didn't find it. That's the main reason I used GDAL here... If it is not available, would it be better to use rio convert directly?

Edit: Maybe I can just make a simplified version of this: https://github.com/rasterio/rasterio/blob/main/rasterio/rio/convert.py Doesn't look very difficult. I'll try this tomorrow and check whether the performance is good enough.

m-mohr commented 2 years ago

rio convert doesn't work for me. For large files the process gets killed due to having not enough memory. The adapted code here in the package also gets killed: https://github.com/stactools-packages/esa-cci-lc/commit/f67ef4c4e2e88fad8f6c86be7b01849cb558b756

Not sure what gdal_translate does dfferently, but rasterio doesn't look like an alternative right now. I also tried windowed reads, but that is very slow: https://github.com/stactools-packages/esa-cci-lc/commit/d20396be826c38d07f4fe7ff3aed4226a398116e

gadomski commented 2 years ago

~Yeah, I'm playing as well, and it definitely is a big-enough dataset that it could be worth splitting down into tiles. Could be enough to just do four quadrants (NW, NE, SW, SE)?~

~IMO if we're struggling to open/read/write w/ rasterio, downstream users will as well, so we should try to solve the size issue at the convert-to-cloud-native-format step.~

EDIT: I might have jumped the gun here -- turns out COG conversion w/ rasterio wasn't too bad on my system (~8 minutes):

$ time python scripts/convert.py ~/Downloads/ESACCI-LC-L4-LCCS-Map-300m-P1Y-2015-v2.0.7cds.nc
Squeezing values
Writing memory file
Writing COG
python scripts/convert.py   434.56s user 82.05s system 93% cpu 9:11.41 total
$ du -h cog.tif 
656M    cog.tif

Script here: https://github.com/stactools-packages/esa-cci-lc/blob/f4da9aff86556252d6a39586de90a7bf3b9a3a36/scripts/convert.py.

BUT it still may make sense to chop that file into quarters, especially since we'll also have a float COG which will be a lot bigger.

m-mohr commented 2 years ago

Thanks for investigating, @gadomski.

If I understand correctly, this is 8 minutes for the lccs_class COG only, right? I had a runtime of about 4-5 minutes for them, but of course with the GDAL wrapper and we have different machines. The size is roughly the same. I really tried to optimize runtime, but if slower is fine, we can surely get rid of the wrapper and go with slower processing times, too.

Why will we have a float COG though? We only have uint8 and uint16 (observation_counts) in the source. Also I don't think the uint16 file would fit into (my) memory. Have you tested the observation_counts variable?

If not, I can give your script a try on Monday and then report back how it works for all variables.

gadomski commented 2 years ago

If I understand correctly, this is 8 minutes for the lccs_class COG only, right?

Yup.

Why will we have a float COG though?

Ah, I must have mis-remembered an earlier conversation about a float quality band.

Have you tested the observation_counts variable?

I have not, but if we're even close to hitting memory bounds we probably should tile the data up.

m-mohr commented 2 years ago

Well, we are not hitting them with the current code on main. Main is running without issues (for me) except that it's using the GDAL bindings. But I'd assume we run into memory issues as the proposed code uses MemoryFile(), but all my tests with loading the whole observation_counts data was not working due to the lack of memory (available were ~10GB).

Right now we only need to tile for obversation_counts, but that's rarely used quality data so I'd say the way it is works good enough without tiling.

m-mohr commented 2 years ago

For observation_count, this gets already killed during the value squeezing due to the amount of memory required.

I've added gdal now correctly as a dependency though, not specifying it was an oversight on my side.

m-mohr commented 2 years ago

So to conclude this for now: GDAL makes a good job and I can't make it work efficiently with rasterio in a reasonable amount of time. The dependency issue was likely caused by forgetting a dependency in the setup.cfg which has been added meanwhile, so I assume this works now. So for now I'm not planning any changes. Please let me know if you don't agree / have a better idea.

gadomski commented 2 years ago

Yup, this sounds find to me. Thanks for closing this loop.