opendatacube / datacube-core

Open Data Cube analyses continental scale Earth Observation data through time
http://www.opendatacube.org
Apache License 2.0
505 stars 176 forks source link

Feature request: make netcdf4 dependency optional #1241

Closed Kirill888 closed 1 year ago

Kirill888 commented 2 years ago

Right now netcdf4 library is a non-optional dependency of datacube, but it's use within the library is actually fairly limited. With deprecation of ingestion step and move towards cloud, netcdf data sources are becoming less relevant in many deployments of the datacube. netcdf4 is a rather heavy dependency both in terms of "disk used" and complexity of installation. Having lean dependency set is particularly beneficial for things like cloud deployments (AWS lambda layer limit is 250Mb for example).

I believe that in the case of netdcf4 making it optional is relatively low cost and the benefit is significant. Ideally of course this would require automated testing across different python environments and that can be tricky and complex to setup, but we can start with just a manual test.

woodcockr commented 2 years ago

I'd prefer to see this feature request as "Make Zarr more prominent given the move to Cloud data sources" or more forward looking rather than straight removal. There are good reasons for storage regimes other than COGS, particularly with more hyperspectral and higher temporal frequency satellites.

It may be prudent to consider this not from the perspective of a less commonly used package but the feature set it is indicative of and canvas the user community to provide a more evidenced based sense of the requirements and direction (the now completed Enhancement proposal that added Zarr support is a good example). Be also good to check netcdf cloud support as there have been a great deal of recent development which is changing that situation.

The reference to ingestion deprecation requires deeper consideration I think. It is deprecated from the perspective of the Open Data Cube and no longer truly supported. I am not in making this statement advocating for ODC ingestion features nor discouraging the separation of concerns underway that allow odc-stac to operate (indeed these refactoring benefit ODC as it stands as well as the STAC activity). I am noting the need to copy and curate data is not eliminated and is still required to be performed by anyone managing collections of data. That Open Data Cube chooses (at the moment) not to support this doesn't change this reality, any more than having STAC APIs eliminates the need for a Open Data Cube database support for anyone curating data not available via a STAC API or not meeting Governance requirements (e.g. sovereignty) for use.

I'm also not convinced regarding the removal of netcdf4 assisting with 250Mb levels of disk used is truly an issue for ODC. ML for EO is increasingly common and the two commonly used ML packages weigh in at 3 Gigs and 5 Gigs respectively - netcdf4 removal won't help with that type of deployment issue.

Perhaps the key here is to make netcdf4 optional and ensure key elements of the feature set it represents are still carried forward. Otherwise Open Data Cube may well become too narrowly focussed in feature set, storage regime and structures when there is more to consider.

Kirill888 commented 2 years ago

@woodcockr the proposal is not to remove netcdf support altogether, but rather to ensure that datacube can be safely used in an environment where netcdf libraries are not installed. So this is similar to what we currently have for s3 support, if you didn't install datacube with [s3] feature flag and there is no boto3 in your environment datacube will still operate as normal so long as you don't attempt to index any datasets from s3. Same should be possible for netcdf, so long as you do not attempt to index or read netcdf files you should not need to install netcdf libraries. In fact I'm not even sure if we use netcdf4 library directly when reading netcdf sources, only via rasterio gdal. Netcdf library is used when indexing netcdf files, to extract metadata from the file, and during ingestion to write out netcdf files.

woodcockr commented 2 years ago

That makes sense, I misunderstood so thanks for the clarification. In the sense described I'm happy for netcdf4 to be made optional.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.