glue-viz / glue-geospatial

Experimental plugin to support satellite imagery
BSD 3-Clause "New" or "Revised" License
13 stars 3 forks source link

Identifier function false positives #12

Open astrofrog opened 8 years ago

astrofrog commented 8 years ago

At the moment, the is_geospatial function is recognizing pretty much any RGB image file:

In [1]: from glue_geospatial.data_factory import is_geospatial

In [2]: is_geospatial('random_image.png')
Out[2]: True

@robintw - I wonder whether there is some kind of meta-data we can look for that would identify files as being specifically geospatial data?

robintw commented 8 years ago

We could look to see if a Co-ordinate Reference System (CRS) is defined, that should show whether it is spatial data or not.

Something like:

r = rasterio.open(filename)
if len(r.crs) == 0:
   # No spatial data
else:
   # We have spatial data
robintw commented 8 years ago

Actually, the problem with this is that various 'geospatial' datasets actually have no CRS defined. This can be for various reasons, including lazy programmers (eg. some of my test algorithms don't propagate the CRS info between files properly), processing errors, a deliberate choice not to provide georeferencing information in the file itself (sometimes it is provided in a separate metadata file, for some unknown reason).

Is it a particular issue if random_inage.png is picked up by this DataFactory? Would you prefer that it wasn't picked up at all? Or was picked up by another factory?

astrofrog commented 8 years ago

@robintw - if you have a random PNG file (say of a cat), then the main difference between the current RGB data factory and the geospatial one is that the names of the components will be Red, Green, Blue, and Band 1, Band 2, and Band 3 respectively.

Are there a limited number of extensions that are used for geospatial data, or are JPEG and PNG used for instance?

I guess we just need to decide on the priority of the data factories - we could for instance give the generic RGB reader priority if and only if no metadata is present in the RGB file. But in this case, would you still want the components named Band 1, Band 2, Band 3?

robintw commented 8 years ago

Yes, we can probably do this based on extensions: satellite data are never (to my knowledge) in JPG or PNG. Some are, however, in JPEG2000 (extension .jp2).

I have no particular preferences about standard RGB data: probably Red, Green and Blue are better as names of components for them.

How do we set the priorities for DataFactories? Is it a single static constant for each factory, or can it change as you get more information (eg. we try getting metadata using rasterio, if we can't find any then we decrease the relative priority of the geospatial reader, etc.).

astrofrog commented 8 years ago

@robintw - the priority is set by an argument in the @data_factory decorator:

https://github.com/glue-viz/glue/blob/master/glue/core/data_factories/hdf5.py#L41

I hadn't thought of having the identifier return the priority - that would be even better, since it would allow more fine tuning as you suggest.