Closed gadomski closed 2 years ago
Is there a Python API binding equivalent to rio convert? I couldn't find one, but maybe I just didn't find it. That's the main reason I used GDAL here... If it is not available, would it be better to use rio convert directly?
Edit: Maybe I can just make a simplified version of this: https://github.com/rasterio/rasterio/blob/main/rasterio/rio/convert.py Doesn't look very difficult. I'll try this tomorrow and check whether the performance is good enough.
rio convert
doesn't work for me. For large files the process gets killed due to having not enough memory.
The adapted code here in the package also gets killed: https://github.com/stactools-packages/esa-cci-lc/commit/f67ef4c4e2e88fad8f6c86be7b01849cb558b756
Not sure what gdal_translate does dfferently, but rasterio doesn't look like an alternative right now. I also tried windowed reads, but that is very slow: https://github.com/stactools-packages/esa-cci-lc/commit/d20396be826c38d07f4fe7ff3aed4226a398116e
~Yeah, I'm playing as well, and it definitely is a big-enough dataset that it could be worth splitting down into tiles. Could be enough to just do four quadrants (NW, NE, SW, SE)?~
~IMO if we're struggling to open/read/write w/ rasterio, downstream users will as well, so we should try to solve the size issue at the convert-to-cloud-native-format step.~
EDIT: I might have jumped the gun here -- turns out COG conversion w/ rasterio wasn't too bad on my system (~8 minutes):
$ time python scripts/convert.py ~/Downloads/ESACCI-LC-L4-LCCS-Map-300m-P1Y-2015-v2.0.7cds.nc
Squeezing values
Writing memory file
Writing COG
python scripts/convert.py 434.56s user 82.05s system 93% cpu 9:11.41 total
$ du -h cog.tif
656M cog.tif
Script here: https://github.com/stactools-packages/esa-cci-lc/blob/f4da9aff86556252d6a39586de90a7bf3b9a3a36/scripts/convert.py.
BUT it still may make sense to chop that file into quarters, especially since we'll also have a float COG which will be a lot bigger.
Thanks for investigating, @gadomski.
If I understand correctly, this is 8 minutes for the lccs_class COG only, right? I had a runtime of about 4-5 minutes for them, but of course with the GDAL wrapper and we have different machines. The size is roughly the same. I really tried to optimize runtime, but if slower is fine, we can surely get rid of the wrapper and go with slower processing times, too.
Why will we have a float COG though? We only have uint8 and uint16 (observation_counts) in the source. Also I don't think the uint16 file would fit into (my) memory. Have you tested the observation_counts variable?
If not, I can give your script a try on Monday and then report back how it works for all variables.
If I understand correctly, this is 8 minutes for the lccs_class COG only, right?
Yup.
Why will we have a float COG though?
Ah, I must have mis-remembered an earlier conversation about a float quality band.
Have you tested the observation_counts variable?
I have not, but if we're even close to hitting memory bounds we probably should tile the data up.
Well, we are not hitting them with the current code on main. Main is running without issues (for me) except that it's using the GDAL bindings. But I'd assume we run into memory issues as the proposed code uses MemoryFile(), but all my tests with loading the whole observation_counts data was not working due to the lack of memory (available were ~10GB).
Right now we only need to tile for obversation_counts, but that's rarely used quality data so I'd say the way it is works good enough without tiling.
For observation_count, this gets already killed during the value squeezing due to the amount of memory required.
I've added gdal now correctly as a dependency though, not specifying it was an oversight on my side.
So to conclude this for now: GDAL makes a good job and I can't make it work efficiently with rasterio in a reasonable amount of time. The dependency issue was likely caused by forgetting a dependency in the setup.cfg which has been added meanwhile, so I assume this works now. So for now I'm not planning any changes. Please let me know if you don't agree / have a better idea.
Yup, this sounds find to me. Thanks for closing this loop.
I don't think it's explicitly documented anywhere, but we generally try to avoid GDAL's Python bindings, as they often cause environment setup issues; we prefer
rasterio
and/orxarray
to read raster data. I see GDAL is used for the COG conversion: https://github.com/stactools-packages/esa-cci-lc/blob/f3a4afc3b4a08c1284350d33403c4a30a11d3803/src/stactools/esa_cci_lc/cog.py#L105Could this be done with rasterio? Or was rasterio causing the performance issues we discussed earlier?
FWIW, I tried a simple
pip install -e . && pip install -r requirements-dev.txt && pytest
and got GDAL Python binding issues, probably becausegdal
wasn't specified as a direct dependency insetup.cfg
. However, if possible, it'd be better to use rasterio/xarray IMO.