opendatacube / datacube-core

Open Data Cube analyses continental scale Earth Observation data through time
http://www.opendatacube.org
Apache License 2.0
500 stars 175 forks source link

Remove geo-registration fallback when loading rasters with missing spatial metadata #673

Open Kirill888 opened 5 years ago

Kirill888 commented 5 years ago

Introduction

When loading data from a file with missing geo-registration metadata datacube will not fail, instead it will silently substitute missing metadata with metadata recorded in the Dataset object.

Problems

  1. It's a really niche feature with very little use, yet the costs are high
    • Significantly increases unit test burden
    • Not often used (i.e. tested) in production
    • Complicates IO driver interface (need to supply fallback CRS/transform)
  2. Spatial metadata is recorded per Dataset, not per band within a dataset, assuming all bands share the same geo-registration which is not the case for Landsat for example.
  3. We don't even store enough metadata per Dataset to support the fallback: transform is not recorded, we compute it from Dataset bounding box recorded in DB and shape of the image file being loaded (shape is not recorded in the DB, yeah I know!!)
  4. Performance costs, particularly in distributed environment
    • Rather than just sending filename + band index we need to send entire Dataset object to IO worker to support a case that always never happens.

Proposal

Get rid of this "feature" and instead fail when encountering files that do not have geo-spatial information available.

omad commented 5 years ago

Thanks for the write up. I think we can definitely remove the code for doing this.

It sounds like a huge maintenance burden for minimal or no benefit.

This "feature" was originally added to support some non-gdal-compliant datasets, but I can't even remember which datasets, and who it was for.

We've taken the approach elsewhere that data added to ODC should be readable by GDAL and regularly gridded, and this decision would be in line with that.

Kirill888 commented 5 years ago

With the addition of custom IO drivers this can be better implemented as a driver if/when is needed. Although "fix your data" should be preferred, it's really only useful when data is readable but not writeable, and you can not afford to duplicate it.

woodcockr commented 5 years ago

Not often used (i.e. tested) in production

Who did you check with? How is use measured?

It's kind of one of those little used "saved my bacon" looking features related to cleaning up and munging real world data into an analysable form. It may not fit the "well curated" world of production EO information curation but in real world use by EO scientists using multiple sources of data I can imagine it being used for good reason. It was clearly added by someone for a reason.

Just a little concerned this decision is based on a narrow view. Though I agree the maintenance cost may not be worth it.

4 concerns me a little as well, not the dot point, but the "filename + band index". What if the storage units are no longer files? Probs not relevant to the current issue but ODC does seem to be narrowing its conceptual design to a file management oriented view and I wonder if that is what we want as a Data Cube?

Kirill888 commented 5 years ago

@woodcockr I have asked @simonaoliver about historical context for this. It seems this have been added to support some netcdf files from BoM, and also maybe Modis. My understanding is that these files have geospatial metadata in them, but were not, at the time, readable by GDAL, this might or might have not changed since. If you have examples of production data that relies on this feature, I'm all ears, hence this issue. Better still, would love a PR with unit tests and example data.

Use is not measured, but majority of products on NCI right now are ingested products that have valid CRS and transform embedded in them. Same goes for data from USGS and Sentinel.

I do think that one is better off fixing data to be readable by GDAL, instead of relying on this "hack", there are many options:

  1. Re-encode to a better supported format
  2. Add .aux file with geospatial data understood by GDAL
  3. Create VRT with fixed spatial data

In case of Netcdf: write a driver that uses Netcdf lib to parse/load data and don't rely on GDAL if it has poor support for your files. And it is a "hack", because spatial data that is stored in the Dataset yaml is only an approximation of what is really needed to implement fallback, footprint+CRS is not enough, since this assumes that Dataset footprint == Every band footprint.

Regarding your concern with respect to "filename + band index": new IO driver design has provisions for propagating driver specific information from DB to the loading code, we are just being more precise about information flows. Current approach of "just pass around this giant Dataset object everywhere because it's convenient", has serious limitations. We already face RAM issues with some of our product generation on NCI at the metadata processing stage, now scale that up some more for whole of Africa cube.

This whole feature is about behaviour of a default rasterio based driver, other IO drivers can implement whatever Geospatial fallback mechanisms they desire. As it stands everybody is paying performance costs for a feature that only few use, and there are better and more correct ways to address those use cases anyway.

woodcockr commented 5 years ago

I don't have examples for "production data", my comment was two fold:

Whilst fixing source data to be properly encoded is preferred most data scientists I work with also avail themselves of logical hacks to get the job done. The ODC is a data scientist tool, not just a production data management system. I don't recall the reason this code was added, was simply highlighting a different perspective to that of the production case GA is most often faced with and that there are others in the ODC community potentially impacted by implementation and design decisions.

re: "filename+band index" - Yeah, current approach is awkward. It was the assumption there is a file that concerned me, not everything is GDAL in ODC. A more useful "Storage Unit" specification with driver specific content sounds like a plan.

Kirill888 commented 5 years ago

@woodcockr I should have been more precise that this is about default rasterio/GDAL-based IO driver, not ODC in general. In the context of rasterio based driver everything is a "file", or more precisely a "url".

woodcockr commented 5 years ago

@Kirill888 All good, we understand one another and ODC will be better for it.

FYI, I like the way you think about things, and your code is great, and you work fast. If you see me jumping on your comments, its not jumping on you, just trying to get across it before you've completely implemented it! ;-)

alexgleith commented 5 years ago

To throw my $0.02 in, I think that if we can simplify things by removing a feature where we don't have any examples of it being used currently then we should.

I also agree that failing early on the case where a file/thing doesn't have a required attribute is fine. We should be enforcing rules around data not providing workarounds.

I support this proposal.

woodcockr commented 5 years ago

I'm only hesitant because of the statements of not having any evidence of it being used. There is no way to know if this impacts anyone. A similar argument was used previously on a different topic resulting in breaking changes both in Australia and Overseas. That doesn't build a strong sense of community trust.

Can we use a Futures warning that the functionality will be dropped in X months time so folks get a heads up and can respond accordingly to either have it incorporated or let it drop and do their own work around (aka fix their data)?

Dropping features that already exist cold turkey just doesn't strike me as a wise approach in the absence of actual usage information.

Kirill888 commented 5 years ago

At the very least this a terrible default option. And it's a "half feature", so I can provide fallback for missing data, great!, but I can not override existing but "broken" data, why not?

Regarding real-world use: I have not seen a single prepare script that is geared towards this use case. If anyone is using it, it is probably GA, and I can't seem to find which products on NCI relying on this, yet alone who uses these products.

@woodcockr you seem to discount the cost of existing solutions. Just because a feature already exists (in undocumented state by the way in this specific case) does not mean it is cost free. It costs in complexity, in documentation effort, in testing effort, in maintenance effort, it hurts performance, it makes refactoring work much harder. I'm not proposing to delete things cause "I hate all old code" there are real reasons: I had to spend significantly more effort on implementing new concurrent version of rasterio based driver exactly because of legacy features like this one, since I need feature-parity for a new solution to be successful.

alexgleith commented 5 years ago

I agree with what you're saying here, @Kirill888.

I've been inspired by what Vladimir Agafonkin says about Leaflet. Last year he was asked what future developments Leaflet needs, and he basically said less features! https://dev.to/tomchadwin/comment/4158

Keep it simple...

woodcockr commented 5 years ago

@Kirill888 No, I do not and did not "discount the cost of existing solutions". To quote my earlier post "I agree the maintenance cost may not be worth it". In this regard I am in agreement with the proposal. I am concerned with building the ODC community and suddenly removing features people depend upon simply because I, with my limited view of the ODC usage, cannot see an existing feature in use is a sure fire way to destroy trust.

Presumably the intent is to remove this feature from the next release. Would it be difficult to add a Futures Warning to current code and then remove it at the next major release?

If nothing else it shows we care and are working on improving the deprecation process. Feel free to go cold turkey, the assumption of lack of use may well be correct. But in future I suggest we use a better method than assumption.

Kirill888 commented 5 years ago

What is a feature? Where is the boundary between internal implementation detail one should not depend on and a proper stable API that we should never change?

In the case of ODC this is all very "fuzzy". Apart for dc.load and cli tools for index manipulation what is a datacube API?

woodcockr commented 5 years ago

Those questions are exactly why an appropriate level architecture document is a good thing to have.

In this case though the question isn't related to API, the feature is behaviour related, the ODC does "this" soon it won't. If that matters to someone then it is a feature, regardless of API change.

Is there a problem with providing a "futures warning" in either this or future cases? Seems to work for the rest of the Python community.

Kirill888 commented 5 years ago

API is more than function signature and parameter names, it is also expected behaviour for situations like the one we discussing here. I have seen zero documentation that promised support for files with broken geospatial data. I wasn't even sure it existed until I started refactoring, by the way the internal name for a class that provides this is OverrideBandDataSource which is a completely misleading name since it does not allow for overriding, it only provides fallback.

Feature for one is bug for another. I'll give you an example: say a particular rasterio release ships with broken PROJ.4 and now all my valid data files start failing with "can't parse CRS". Do I really want things to continue to "just work" because we have a fallback? Except it's not "just work" it's "just kinda work but now with extra bias", because band footprint is not the same as dataset footprint".

This can happen, we use different GDAL to parse CRS in ODC than what rasterio ships with.

And yes it's totally reasonable to deprecate features slowly, but it is not without a cost.

Kirill888 commented 5 years ago

If we had a feature: "For explicitly marked files allow alternative geo-referencing to be supplied at indexing time", I would not suggest to remove that, I would just re-implement it in the new driver, and write tests for it. Sadly, this is not the feature we have. The feature we have is: "When failing to parse geo-referencing data in any file, don't report an error, but rather use approximate geo-referencing data. And, no, you can not opt out. Also you can not supply more accurate geo-referencing data either, so better make sure things don't fail that way."

Who is going to write detailed documentation for this "feature", when described for what it is, do you still want it? @woodcockr? Is this still a feature or is it a bug now?

woodcockr commented 5 years ago

@Kirill888 Feature or not, bug or not, documented or not, it is an existing behaviour in the ODC and it is about to be removed. I am only suggesting that it is better practice to provide a heads up that a change in behaviour is about to occur in case someone has a dependency on that behaviour, someone that we have no way of knowing about it because we do not know the actual usage patterns of all the users of the ODC.

As I have said previously, feel free to remove the feature due to the maintenance costs and the "oddness" of it (to paraphrase the other aspects of the thread). If there is a way to do this gently then do so because it is better to do that than to assume no one is impacted. This practice would prevent the type of problems we have seen previously where an assumption regarding use is made that impacts others adversely.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.