pytroll / pyresample

Geospatial image resampling in Python
http://pyresample.readthedocs.org
GNU Lesser General Public License v3.0
344 stars 95 forks source link

Refactor for boundary lat/lon extraction #546

Open ghiggi opened 10 months ago

ghiggi commented 10 months ago

This PR introduces a consistent interface for dealing with area boundaries in pyresample. It simplify the codebase and will enable to significantly simplify the code of geometry.py towards pyresample 2.0

The PR arises from my understanding that users and developers often needs:

The user need to be able to retrieve such coordinates in the geographic space (available for all areas) or, for specific cases (AreaDefinition), in the projection coordinates. A simple and easy to use interface for that is currently missing.

Currently we had methods which returned tuple (x,y) or sides with unconsistent behaviour regarding:

With this PR, I aim to make all boundary related operations consistent across a unique interface. To this end I introduced to classes: SphericalBoundary and PlanarBoundary:

To retrieve AreaDefinition boundary projection coordinates, I added the area.projection_boundary(). This method returns a SphericalBoundary if the CRS is geographic, otherwise a PlanarBoundary class. To retrieve the SphericalBoundary, SwathDefinition and AreaDefinition share the area.boundary() method.

To deal with boundary geographical coordinates, 4 different boundary classes existed: AreaBoundary, AreaDefBoundary, SimpleBoundary, Boundary. In the PR, I introduced the SphericalBoundary which inherits methods from AreaBoundary (to guarantee backward compatibility), and I deprecated the use of the other existing boundary classes. The SphericalBoundary can be retrieved using area.boundary().

Here below I document the deprecations and the proposed replacement:

area.get_boundary_lonlats() --> DEPRECATED 
area.get_edge_lonlats()  -->  area.boundary().contour() 
get_geostationary_bounding_box_in_lonlats(area) -->  area.boundary().contour() 

area.get_edge_bbox_in_projection_coordinates() --> self.projection_boundary().contour()
get_geostationary_bounding_box_in_proj_coords(area) --> area.projection_boundary().contour() 

area.get_bbox_lonlats() -->  area.boundary().sides
area._get_geo_boundary_sides() --> area.boundary().sides

<previously missing> -->  area.projection_boundary().sides

For the SphericalBoundary and PlanarBoundary there are the following common methods:

The SphericalBoundary has the unique properties:

The PlanarBoundary has the unique properties:

This PR introduces the following backward-incompatibilities:

Ongoing bug in get_geostationary_bounding_box_in_proj_coords(nb_points)

 import xarray as xr
from pyproj import CRS, Proj
from shapely.geometry import Polygon
import pyresample
from pyresample import geo_filter, parse_area_file
from pyresample import AreaDefinition
from pyresample.geometry import _get_geostationary_bounding_box_in_lonlats, get_full_geostationary_bounding_box_in_proj_coords

def create_test_area(crs, width, height, area_extent, **kwargs):
    """Create an AreaDefinition object for testing."""
    args = (crs, (height, width), area_extent)
    args = (crs, width, height, area_extent)
    attrs = kwargs.pop("attrs", {})
    area_id = attrs.pop("name", "test_area")
    args = (area_id, "", "") + args
    area = AreaDefinition(*args, **kwargs)
    return area

def truncated_geos_area():
    """Create a truncated geostationary area."""
    projection = {'a': '6378169', 'h': '35785831', 'lon_0': '9.5', 'no_defs': 'None', 'proj': 'geos',
                  'rf': '295.488065897014', 'type': 'crs', 'units': 'm', 'x_0': '0', 'y_0': '0'}
    area_extent = (5567248.0742, 5570248.4773, -5570248.4773, 1393687.2705)
    width = 3712
    height = 1392
    geos_area = create_test_area(projection, width, height, area_extent)
    return geos_area

geos_area = truncated_geos_area()
self = geos_area
vertices_per_side = 20

lons, lats = _get_geostationary_bounding_box_in_lonlats(self, nb_points=vertices_per_side)
lons.shape # 11 points ... not 20 !
lats.shape # 11 points

lons, lats = _get_geostationary_bounding_box_in_lonlats(self, nb_points=50)
lons.shape # 22 points  ... not 50 !
lats.shape # 22 points 

# The problem arise in get_geostationary_bounding_box_in_proj_coords
x, y = _get_geostationary_bounding_box_in_lonlats(self, 50)
x.shape # 22 ... ... not 50 !
geo_p = Polygon(np.vstack((x, y)).T)

# The nb_points is just used to define the nb_points of the full disc boundary coords
x, y = get_full_geostationary_bounding_box_in_proj_coords(self, 50)
x.shape # 50 
geo_full = Polygon(np.vstack((x, y)).T)

# This is what happen within get_geostationary_bounding_box_in_proj_coords
x, y = get_full_geostationary_bounding_box_in_proj_coords(self, 50)
ll_x, ll_y, ur_x, ur_y = geos_area.area_extent

geo_bbox = Polygon(np.vstack((x, y)).T)
area_bbox = Polygon(((ll_x, ll_y), (ll_x, ur_y), (ur_x, ur_y), (ur_x, ll_y)))
intersection = area_bbox.intersection(geo_bbox)
try:
    x, y = intersection.boundary.xy  # here random number of points (not granted that is even !)
except NotImplementedError:
    return [], []
return np.asanyarray(x[:-1]), np.asanyarray(y[:-1])
ghiggi commented 10 months ago

@mraspaud this PR is ready for review. There are 3 test units failing related the gradient.__init__ code ... but I can't manage to find the root cause because of the complexity of the code. Can you start to review it and try to see what is going on? Or alternatively, can you point me on how you debug such test units interactively? Thanks in advance !

ghiggi commented 10 months ago

I try to answer to all the elements of discussion point by point here below.

Geostationary x/y sides

get_geo_<utility> functions

To be safe, I will keep the geostationary utility function as public. However, I checked ... and satpy, pyspectral and the pyorbital packages does not use any of these functions. In satpy, in the satpy/readers/utils.py there a some utility to extract specialized geostationary area information from AreaDefinition objects. Some functions looks quite old code like get_geostationary_bounding_box, _lonlat_from_geos_angle. Maybe could be useful, in a future PR, to move all geostationary utilities in a single separate module in pyresample.

Future Boundary Interface

Boundary Sides Consistency

get_boundary_lonlats and SimpleBoundary usage

Conclusion Lot of stuffs to digest. I guess any of these points can be addressed in a separate PR 😄 In meanwhile, I will take care of your review comments. Let me know your opinion for the bug (and MRE) provided in the PR description, and then let’s try to understand the cause of the failing test in gradient.py; I struggle to debug such test units ... and I look forward for your feedbacks 😄

djhoese commented 10 months ago

If we want to have _lonlats and _proj_coords functions, it means there we would need a boundary class for AreaDefinition and one for SwathDefinition (without _proj_coords) objects.

I'm not so sure. I think it is one of the failings of our past understandings that all latitude/longitude degrees are the same. That is just not true. There can be prime meridian shifts or different Earth models/datums used. So there is some idea of "native" CRS and a preferred geographic/geodetic CRS. I could easily see a helper object (as I mentioned in my previous comment) that gets you x or y or lon or lat, but allows you to convert that object to another version with a different preferred geodetic CRS.

As a personal a taste, I find more readable code without too much get. get … and I guess most of relevant stuffs can be defined as properties. Regarding the chunks option, are you sure that is needed when retrieving the lon/lat edges? One can always .rechunk or dask.array.from(array, chunks)

Is it the get that you don't like? Or the many small methods/functions? I think hiding some of this functionality in helper classes makes it irrelevant. I also think in a lot of cases we could get rid of the get, but I'm fearful of turning most things into properties ever since we started playing with dask/xarray. With dask some of these methods end up being able to produce dask arrays and there is a real benefit performance-wise of knowing/defining what that chunk size is from the creation point versus re-chunking after.

I'm sure there are some use cases where we define polygons or boundaries where this idea of needing to define dask arrays is unnecessary, but I'd want to be careful. In a lot of cases I don't think I'd want to generate a polygon and then not immediately use it. However, that polygon may be generated from large lon/lat arrays (this is the worst case scenario I think of all the scenarios) and in that case there may be a benefit to computing the resulting polygon at the same time as all other computations (if possible) to reuse the compute lon/lat arrays (dask would share the result of the tasks). But that being the worst case maybe we just "deal with it". Lon/lat arrays are already the slowest form of computation we use.

However, I noticed that when we extract a boundary, if the original sides are not clockwise, and we force_clockwise=True, then the side position currently returned by get_bbox_lonlats become [left, bottom, right, top] (see here)I think this should be fixed ... as it's prone to introduce bugs (I don't tell you the time wasted to discover this point when implementing the spherical tools for my projects ... ).

I'm not sure I agree. "top" and "bottom" are relative. Is the top of the swath the last scan of the data or the first? Is it the most northern? What if we have a non-polar orbiting satellite? If the sides are always generated from a polygon maybe that will always be more clear (shapely or other library determines what that means)?

I believe that the clockwise/anticlockwise reordering should be applied only when returning the contours/vertices/lons/lats in the boundary object and not at the sides list definition in get_bbox_lonlats !

I'd have to check how I used this method in my Polar2Grid project. It is likely that I put it in get_bbox_lonlats because that is used internally in the area boundary class (if I recall correctly). But also, other polygon libraries probably need clockwise coordinates.

We could add to AreaBoundary the method set_clockwise() and set_anticlockwise(), which regulates the order of the returned contours/vertices/lons/lats arrays.

I don't think it is that people need the counter-clockwise (CCW) version of the coordinates, it is that the polygon/boundary math requires the clockwise (CW) version. Besides wasted performance I don't think I see a reason to never not return CW.

it’s not always clear to me when I see such name, what is the inner structure of the object.

Yep. Could probably return a namedtuple at the very least. This was just old interfaces before anyone knew what they were doing I think.

...Ah I just read your BoundarySides idea. Same thing. If it is a full class instead of a namedtuple then I might suggest we try using __slots__ to keep the "weight" of the class small.

Conclusion

I think in general I like where you're going, but I still think we're 1-2 steps away from a design that really "wows" me. I think the idea of something like an area_def.boundary.left.lonlats is exciting as it opens up a usage pattern that works for other use cases too.

djhoese commented 10 months ago

For this failing test:

https://github.com/pytroll/pyresample/blob/c0025ed8905cb8d026de4b3a48451346b802b36c/pyresample/test/test_gradient.py#L246-L254

Can you print out self.swath_resampler.coverage_status? My guess is this method is filtering out all chunks of the source data:

https://github.com/pytroll/pyresample/blob/c0025ed8905cb8d026de4b3a48451346b802b36c/pyresample/gradient/__init__.py#L223-L235

And then the list passed to np.max is empty.

That coverage_status is filled here:

https://github.com/pytroll/pyresample/blob/c0025ed8905cb8d026de4b3a48451346b802b36c/pyresample/gradient/__init__.py#L164-L211

But I'm just using what I get from reading the code. I didn't code this so I could be completely missing something.

ghiggi commented 10 months ago

Just to better clarify a couple of thoughts @djhoese :

I'm not sure I agree. "top" and "bottom" are relative. Is the top of the swath the last scan of the data or the first? Is it the most northern? What if we have a non-polar orbiting satellite? If the sides are always generated from a polygon maybe that will always be more clear (shapely or other library determines what that means)?

I don’t refer to the orientation in terms of geographic space, but with respect to the 2D lon/lat arrays. The [ŧop, right,bottom, left] sides of the coordinate arrays.

I'd have to check how I used this method in my Polar2Grid project. It is likely that I put it in get_bbox_lonlats because that is used internally in the area boundary class (if I recall correctly). But also, other polygon libraries probably need clockwise coordinates.

Yes the AreaBoundary class (at least in pyresample) does not correct for clockwise/anticlockwise ordering. The force_clockwise reording was applied in get_bbox_lonlats because the outputs of this function were concatenated to get directly the lon/lat boundary vertices with the correct orientation. But if we speak of the sides coordinate of an array, clockwise/anticlockwise order make no sense. Is the vertices derived from the sides that must follow the clockwise/anticlockwise assumption. Right now, doing force_clockwise=True for i.e. retrograde orbits (going from east to west) scanning forward (or prograde orbits scanning backward), cause the output sides to be the [left, bottom, right, top] sides of the coordinates/image arrays.

I don't think it is that people need the counter-clockwise (CCW) version of the coordinates, it is that the polygon/boundary math requires the clockwise (CW) version. Besides wasted performance I don't think I see a reason to never not return CW.

If one want to generate Shapely polygons, it requires CCW order. If the area wraps the pole, we must not modify the ordering because the order determines which pole is wrapped in.

djhoese commented 10 months ago

Oh...so our boundary logic (or spherical geometry stuff) needs CW and shapely uses CCW? Ok so there should be an option.

So would/could/should there be a "sides" method on a boundary/polygon or should it be on the area? If there is a "sides" method on the area (and swath) then I think the equivalent of a force_clockwise needs to be on that method, right? Otherwise what if someone wants to do their own polygon math without shapely?

djhoese commented 9 months ago

Ok @ghiggi I've merged this with main which includes my pre-commit changes. There was only one conflicting file on the merge and that was in geometry.py. The biggest conflict was actually just unfortunate where I moved the "edge" method from BaseDefinition to AreaDefinition right under the boundary sides methods and git was confused when it tried to merge them even though they were separate. Otherwise, the area slicing methods/functions are now in _subset.py including _get_area_boundary which you had made a change to and I made sure that was migrated over to that.

Otherwise, my merge just includes a couple things to make pre-commit happy (docstring fixes, indentation/spacing fixes, unused import after it included your changes, etc).

Let me know if you have any problems with this or anything looks confusing. I should be online at my usual times tomorrow.

codecov[bot] commented 9 months ago

Codecov Report

Attention: 86 lines in your changes are missing coverage. Please review.

Comparison is base (25d0c72) 94.08% compared to head (e638557) 93.72%.

Files Patch % Lines
pyresample/geometry.py 77.93% 32 Missing :warning:
pyresample/boundary/planar_boundary.py 70.83% 14 Missing :warning:
pyresample/test/test_geometry/test_area.py 92.00% 12 Missing :warning:
pyresample/boundary/spherical_boundary.py 90.42% 9 Missing :warning:
pyresample/boundary/base_boundary.py 92.42% 5 Missing :warning:
pyresample/boundary/simple_boundary.py 55.55% 4 Missing :warning:
pyresample/boundary/area_boundary.py 40.00% 3 Missing :warning:
pyresample/bilinear/_base.py 66.66% 2 Missing :warning:
pyresample/kd_tree.py 81.81% 2 Missing :warning:
pyresample/gradient/__init__.py 95.45% 1 Missing :warning:
... and 2 more
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #546 +/- ## ========================================== - Coverage 94.08% 93.72% -0.36% ========================================== Files 84 98 +14 Lines 13228 13971 +743 ========================================== + Hits 12446 13095 +649 - Misses 782 876 +94 ``` | [Flag](https://app.codecov.io/gh/pytroll/pyresample/pull/546/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pytroll) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/pytroll/pyresample/pull/546/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pytroll) | `93.72% <91.56%> (-0.36%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pytroll#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

coveralls commented 9 months ago

Coverage Status

coverage: 93.34% (-0.3%) from 93.671% when pulling e638557384845d7e14f2703bc19bfd372d42414b on ghiggi:refactor-get_bbox-lonlat into 25d0c7209f15fd57928fbafccbdaa0060530d6ae on pytroll:main.

djhoese commented 9 months ago

Very nice job. I re-read your updated PR description and I really like the goal you've set for yourself. It makes a lot of sense and I'm glad someone is trying to tackle the problems and inconsistencies.

I think my general overall concern with the changes you've made so far as that I don't like the separation between projections and geographic versions of things. I will admit I may have a misunderstanding of this, but I don't think so. To me, geographic "projections" are not a single thing. They can differ with their model of the Earth and the datum/reference coordinates used for that Earth. As you know we could also shift the prime meridian of a lon/lat CRS and it is still considered geographic.

With the above in mind, I think I'd like to look at combining the different boundary classes into a single class. Another thing that leads me to this idea is that you point out that the two classes differ in their .x/.lons, .y/.lats, and .sides_X properties. But with my understanding in mind .x == .lons and the same for the other properties. And if the CRS becomes a required parameter maybe that's OK?

I'm also not a fan of the "clockwise" behavior of the boundary classes. Specifically the .set_clockwise() and .set_counterclockwise() methods. These seem like they should either:

a. Return a new instance of the boundary with this behavior enforced/applied. b. Make it a keyword argument to the class maybe? c. Leave it as-is maybe?

It just feels odd to me, but I'm not sure I have any better suggestions at this point.

Wild suggestion: Are there enough deprecations and changes here that this should be moved to a single .boundary() method on the future geometry definitions?

ghiggi commented 9 months ago

@djhoese I also don't like the separation between projections and geographic versions of things, but it seems to me required when dealing with boundaries.

Here we are not speaking of geographical or projected CRS , but the presence of two types of coordinates: spherical and planar ! Maybe we should call the classes SphericalBoundary and PlanarBoundary !

For SwathDefinition, we can only define a SphericalBoundary. For AreaDefinition with CRS with planar coordinates, we can define SphericalBoundary and PlanarBoundary. For AreaDefinition with CRS with geographic coordinates, we can only define SphericalBoundary.

I try to elaborate here below some additional thoughts:

Side note: there was already a method boundary()I created 2-3 months ago that returned an AreaBoundary (geographic coordinates) that now I deprecated in favour of geographic_boundary which is returning the GeographicBoundary. Maybe we can keep the .boundary(), make it return GeographicBoundary (instead of AreaBoundary) and remove the proposed geographic_boundary method if you are willing to accept that a method return a class with a different name (GeographicBoundary has all methods of AreaBoundary)

djhoese commented 9 months ago

Ah the Spherical versus Planar makes this make more sense now.

For an AreaDefinition, when extracting the boundary sides projection coordinates (i.e. in meter) and the equivalent lat/lon coordinates on the sphere I might end up with the case where: the concatenation of the area sides with the projection coordinates leads to clockwise vertices the concatenation of the area sides with the geographic coordinates leads to counterclockwise vertices --> From there one of the reason to define 2 separate objects

I don't understand why the type of coordinate retrieved (projection or geographic) would effect clockwise or counterclockwise. Don't we have enough cases of swaths (ascending and descending orbits or other old instrument types) and areas (north-up areas, south-up like FCI, and south-up/east-left like SEVIRI) where any coordinate combination could give you any orientation/ordering?

Besides the cases of what data we started with, what are cases where we need to use spherical boundaries even if we have a projected/planar area? Also, are there cases where we have geographic boundaries but can assume they don't cross the anti-meridian and could therefore be treated as planar coordinates? For example, a mercator or eqc projected dataset is not technically valid if it crosses the anti-meridian (it goes outside the coordinate space of the projection) so there is no need to treat it as a sphere. Similarly, if it does cross the anti-meridian in lon/lat space, but we shift the prime meridian we could (in some cases that don't cross the poles) treat the lon/lat coordinates as planar too.

I think I've realized one of the main reasons I don't like set_clockwise is that it puts state into the object and makes it mutable. It really feels like these objects should be immutable and only be containers for the boundary information and not modifiers of it. I realize that the clockwise wish property is not modifying the data until it is returned, but it causes confusion when using it. I would almost feel better if the boundary object knew the order/orientation of the data (either told by the user as a kwarg or determined by the class when needed) and besides automatically converting for things like shapely object creation, it could probably methods for forcing it that return a new copy of the boundary with the modifications done already. If the orientation/order is the same as the Boundary already then return self. If not, convert and return a new object.

This also makes me think, could the pyresample spherical stuff be made to detect the closed/not-closed state of the polygon provided to it and adjust as needed? That way there doesn't need to be special handling...eh I guess this is a nice convenience for any user to have that ability to get closed or open boundary coordinates.

ghiggi commented 9 months ago

Thansk for all these discussions @djhoese. They also help me to completely free out my thoughts. Answering to your questions:

Besides the cases of what data we started with, what are cases where we need to use spherical boundaries even if we have a projected/planar area?

All pykdtree based stuffs currently exploit spherical boundaries for data reduction (see contour_poly in get_area_slices) ! Only gradient use planar boundaries (and fails then in many cases)

Also, are there cases where we have geographic boundaries but can assume they don't cross the anti-meridian and could therefore be treated as planar coordinates?

Yes many cases. With this PR, we will be in a good place to perform data reduction, on the sphere or with shapely, based on the type of source and target area. We need to perform operation on the sphere if we have:

And we need to stay on the sphere if the two areas CRS bounds does not intersect !

I think I've realized one of the main reasons I don't like set_clockwise is that it puts state into the object and makes it mutable.

I agree with that and I strongly push to deprecate the existing use of force_clockwise. The order should just be set when creating the shapely Polygon or the spherical Polygon depending on the software requirements. In this PR, if set_*clockwise is not called, it returns the boundary vertices as they appear in the coordinate array (top, right, bottom, left)

This also makes me think, could the pyresample spherical stuff be made to detect the closed/not-closed state of the polygon provided to it and adjust as needed?

That's just requires to check if the first vertex is equal to the last

djhoese commented 9 months ago

We mentioned this on slack, but if backwards compatibility is limiting this design at all then I am very in favor of setting up as a boundary method in the future geometries that use this "perfect" boundary class (or classes if needed). The existing/legacy/old geometry definitions could use a special subclass of these future boundary classes that does something like:

from .future.boundary import Boundary as FutureBoundary

class _OldBoundary:
    ...

class LegacyBoundary(FutureBoundary, _OldBoundary):
    ...

That is, the boundary used by the existing AreaDefinition would be LegacyBoundary which is a combination of the FutureBoundary and the old boundary that you didn't want to hold on to just for backwards compatibility. That way when we switch to Pyresample 2.0 we get the "best" design instead of having to still deal with workarounds and hacks just to deal with backwards compatibility.

All pykdtree based stuffs currently exploit spherical boundaries for data reduction (see contour_poly in get_area_slices) !

For some reason I was sure that the spherical stuff was doing stuff in XYZ geocentric coordinate space, but you're right it is all lon/lat. Interestingly the SphPolygon computes the X/Y/Z coordinates but then doesn't ever use them from what I can tell:

https://github.com/pytroll/pyresample/blob/25d0c7209f15fd57928fbafccbdaa0060530d6ae/pyresample/spherical.py#L547-L552

Lastly, about inheritance versus composition/dependency injection: These classes seem very similar and almost not different at all. Unless you have a strong argument against it, it really seems better to require a crs kwarg associated with the provided boundary coordinates/vertices and then do a few if statements to separate out the if self.crs.is_geographic logic when needed. Otherwise, a "composition" style implementation would be something like doing:

class Boundary:
    def __init__(self, ...):
        if self.crs.is_geographic:
            self._polygon_helper = GeographicPolygonHelper(self.vertices, self.crs)
        else:
            self._polygon_helper = PlanarPolygonHelper(self.vertices, self.crs)

    def to_shapely_polygon():
        return self._polygon_helper.to_shapely_polygon()

Now of course I haven't done all of the necessary checks about how the code differs, but I wanted to give an example of what I was thinking/talking about when mentioning these topics.

I'd also like to rethink this idea of passing the type of polygon class to generate with a string keyword argument. There are a lot of design patterns that exist that could help us with this, but at the moment I don't have time to dive into them. This could easily be one of the last decisions made about this PR so not something we need to do right now.

ghiggi commented 9 months ago

The PR is now ready for review @djhoese @mraspaud. I again updated the PR description with the last changes. I finished to implement the interfaces and the code now it ensure the correct ordering of vertices for SwathDefinition and AreaDefinition, which in turn enable to correct creation of SphPolygon and shapely Polygon. I also added code to retrieve boundary of AreaDefinition projections with infinite values at the edges because of out-of-Earth locations (i.e. polar projections, global projections like Robinson, Mollweide, etc).

I now wait for your feedbacks before implementing the last test units. I also plan to still add a commit to enable to pass a tuple of vertices_per_side.