mllam / weather-model-graphs

Tooling for creating, visualising and storing data-driven weather-model graphs
https://join.slack.com/t/ml-lam/shared_invite/zt-2t112zvm8-Vt6aBvhX7nYa6Kbj_LkCBQ
9 stars 9 forks source link

Allow non-regularly gridded and lat-lon coordinates #32

Open joeloskarsson opened 1 month ago

joeloskarsson commented 1 month ago

Describe your changes

This PR allows for more flexibility in the coordinates used to create graphs. They do no longer need to be on a regular $N_x \times N_y$ grid, but can be at arbitrary positions. This means that the main coordinate input no longer has the shape [2, Ny, Nx], but instead [N_grid_points, 2] with each column representing a coordinate for all grid points. This makes it easy to also allow for lat-lon input coordinates, and use a specified projection to map these to a euclidean space.

To get an overview of the lat-lon functionality I recommend checking out the new page I added to the documentation.

The motivation for this change is to allow more flexibility in the regions graphs are built for. This is in particular useful when we consider grid points as the union of some LAM-region and a boundary of global grid-points, possibly on different gridding. With this change such cases can be handled without issues.

There are a few noteworthy changes that comes with this feature:

New dependencies

This PR introduces a dependency to cartopy, as that is needed for the projection specification when working with lat-lons.

Merge and release planning

Issue Link

This PR closes #29

Type of change

Checklist before requesting a review

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

Author checklist after completed review

Checklist for assignee

joeloskarsson commented 1 month ago

This is now ready for review :partying_face:

joeloskarsson commented 1 month ago

In Changelog.md also mention that mesh_node_distance is the new name of grid_refinement_factor.

Yes, that's a good idea. I haven't added any changelog entry yet since I think that we should probably create a new release before merging this, and then the changelog will be updated anyhow and this entry go as the first under unreleased.

Is Euclidean coordinates or Cartesian coordinates easier to understand? Maybe Euclidean is more accurate regardless...?

Thinking a bit more about this (and reminding myself about the exact definitions), I actually think that we should use Cartesian rather than Euclidean everywhere. My reasoning is that cartesian refers to a coordinate system, whereas euclidean refers to a type of geometric space (with arbitrary coordinate system, e.g. could be spherical coordinates). Now, the surface of the earth is not a euclidean space and lat-long not cartesian coordinates, so there's no real risk of substantial misunderstandings, but I guess "Cartesian coordinates" makes sense whereas "Euclidean coordinates" is technically ambiguous.

@sadamov If you agree with this reasoning I would go ahead and change all comments and names to use Cartesian :smile:

These lat-lon based graphs would benefit a lot from heterogeneous graph design it seems (i.e. not living on a equidistant grid of nodes). This is a topic for a future PR though, right?

Yes, they would definitely benefit a lot from that. But I agree that that's for a future PR. One simple thing I was trying out was to just remove any mesh nodes that does not have any grid node connections. One could do that for the flat Keisler-like graphs, but I realized that it becomes tricky for the multiscale(?) and hierarchical graphs. In order to now have different behavior for the different archetypes I left that out now. Better to revisit and think a bit more about when it is clearly needed.

joeloskarsson commented 4 days ago

Note: This PR should have a review by @leifdenby , as by the roadmap. Then it is ready to merge.

leifdenby commented 2 days ago

I'm in the process of giving this a thorough read through and executing the notebooks etc as I go through. In general this is really excellent (as usual), and I fully agree with all of the following three changes:

  • I broke out the flat graph creation into a function create_flat_singlescale_mesh_graph, as it made more sense that also this type of mesh graph had its own function. This made sense to do here as the amount of related code was growing.

  • I also moved some utility functions around the testing (for creating dummy coordinate arrays) into their own tests/utils.py file. This made sense as these functions were duplicated in multiple files before and I had to change them with this new coordinate setup.

  • The argument grid_refinement_factor was earlier based on the number of grid nodes in x- and y-directions. As we do no longer assume that these grid nodes lay in regular rows and columns this definition will no longer work. Instead we will have to define the grid refinement in terms of "mesh nodes / distance", using the actual coordinate values. The argument is also being renamed to mesh_node_distance to match this.

The following I also really like, but I have a question/suggestion wrt it:

  • In the high-level functions that are used to create graphs the xy argument is renamed to coords, as this can be lat-lons or euclidean x- and y-coordinates. If a projection is given this is assumed to be lat-lons, otherwise euclidean. For functions further down (e.g. create_single_level_2d_mesh_graph), that still take only euclidean coordinates, this argument is still called xy.

As I am reading it the intended use in your new implementation has two main use cases (in terms of how the coordinates are defined):

# 1. Using the coordinate values as-is, measuring distance in units of
# coordinate values with the euclidean distance metric. Relevant when working
# with cartesian coordinates.
graph = wmg.create.archetype.create_keisler_graph(
    coords,
    mesh_node_distance=10,  # distance in units of coordinate values, measured as euclidean distance
    projection=None,
)

# 2. First projecting the coordinates to a different coordinate system,
# measuring distance in units of the projection with the euclidean distance
# metric. Relevant when working with lat/lon coordinates.
graph = wmg.create.archetype.create_keisler_graph(
    coords,
    mesh_node_distance=1.0e6,  # distance in units of projected coordinates, measured as euclidean distance
    projection=ae_projection
)

with the current function signature being:

def create_keisler_graph(
    coords: np.ndarray,
    mesh_node_distance: float,
    projection: Optional[cartopy.crs.CRSj] = None,
) -> nx.Graph:

Where it is assumed that distance is measured with the euclidean distance metric, with either the coordinate values as-is or first projecting.

I think it would be cool to extend this in future by allowing for a different distance metric to be used. That would allow us to create graphs from lat/lon coordinates directly (and thereby have graphs that fully wrap the Earth).

For example we could then do:

# 1. Using the coordinate values as-is, measuring distance in units of
# coordinate values with the euclidean distance metric. This would be useful
# for creating graphs from cartesian coordinates.
graph = wmg.create.archetype.create_keisler_graph(
    coords,
    mesh_node_distance=10,
    projection=None,
    distance_metric="euclidean"
)

# 2. First projecting the coordinates to a different coordinate system,
# measuring distance in units of the projection with the euclidean distance
# metric. This would be useful for creating graphs from lat/lons for limited
# areas.
graph = wmg.create.archetype.create_keisler_graph(
    coords,
    mesh_node_distance=mesh_distance,
    projection=ae_projection,
    distance_metric="euclidean"
)

# 3. Using the coordinate values as-is, measuring distance in great circle
# distance with the haversine distance metric. This would be useful for
# creating graphs from lat/lon coordinates directly. We should default to the
# distance being in SI units (as in meters) and so use a fixed Earth radius in
# meters.
graph = wmg.create.archetype.create_keisler_graph(
    coords,
    mesh_node_distance=1.0e4,  # distance of 10 km between coordinates given in lat/lon
    projection=None,
    distance_metric="haversine"
)

With the function signature changed to:

def create_keisler_graph(
    coords: np.ndarray,
    mesh_node_distance: float,
    projection: Optional[cartopy.crs.CRS] = None,
    distance_metric: str = "euclidean"
) -> nx.Graph:

So that by default the function will use euclidean distance with the coordinate values, but the metric can be changed to haversine (which gives approximate great circle distances in meters between lat/lon coordinates, assuming a fixed Earth radius).

All that is then required is to pass the distance_metric argument down the function calls and where the actual distance is calculated and used implement the logic for the different distance metrics.

What do you think of this idea? If it sounds good to you then maybe we could go ahead and add distance_metric as an argument to the function (with default value euclidean) to make the metric used explicit, and then that would make it easy to add the haversine distance metric in the future.

Of course one could argue that it should be possible to provide the Earth radius, but maybe that could become an optional metric_argument down the line. Or we allow people to optionally provide a projection that makes it clear that the coordinates are lat/lon and in that projection object they can then set the globe radius (e.g. with cartopy.crs.PlateCarree(globe=cartopy.crs.Globe(semimajor_axis=..., semiminor_axis=...)) or something like that)