Add a unique identifier for each grid cell (geocoding)

The motivation

A primary feature of the web interface is to be able to reference and visit individual grid cells. By creating a unique identifier via a geocoding scheme for each grid cell, humans will be able to associate an identifier with an important location.

Examples

Grid cell tiles

When viewing a collection of grid cells, each has an identifier (N63.326 in this example):

Viewing a grid cell on the map

When viewing the map, individual grid cells can be selected. The cell identifier is displayed near the top of the left-hand panel (V32.620 in this example), and we can imagine the URL of the view will include the same identifier, like openmethane.org/map/X-123-456:

Why generate in the prior?

Q: Since the cell id is a user interface tool, why should we generate it in the prior? It could be generated by the web interface when the grid data is ingested.

A: The prior is responsible for generating the grid itself, so if each cell needs a unique id, this seems like the logical place to generate it. It will also help us correlate user feedback from the web interface, which will use visible identifiers to denote grid cells, with locations in the model, without referencing the database of the web project.

Identifier requirements

The system to generate identifiers should make ids that are:

as short as possible
URL-safe (ie, avoid special characters apart from '-')
avoid accidental meaning or obscenity
support 10km grid, but can be extended to 1km or possibly other grid resolutions

The proposed solution

I would propose a 3 part scheme: {GRID_ID}.{CELL_COLUMN}.{CELL_ROW}

GRID_ID - a unique identifier for the grid scheme.
- For the 10km grid this might have a value of 10
- For the 1km grid this might have a value of 1
- Although 10 and 1 have a semantic meaning, we could use any non-semantic values for future schemes
CELL_COLUMN and CELL_ROW
- a simple index value from 0 to n where n is the last column/row
- longitudes in the grid span ~ 6321km so values of 0..633 (10km) or 0..6321 (1km)
- latitudes in the grid span ~4485km so values of 0..449 (10km) or 0..4485 (1km)
Examples:
- 10.0.0 through 10.633.449
- 1.0.0 through 1.6321.4485

Pros

scheme is very simple and easy to understand
vertical and horizontal adjacent grid cells have similar ids
- the adjacent cell id in every direction can be guessed
scheme can be extended to support multiple grid sizes
some derivation of lat/long can be derived from row/col values

Cons

for 1km grid, ids are quite long, ie 1.6321.4485 is probably longer than desired
custom scheme is not globally addressable

Alternatives

Serial ids

In theory, based on the domain, each grid will have a fixed number of cells:

10km has approx 283768 cells
1km has approx 28357129 cells

We can simply use a serial id for each cell based on the order of generation.

Pros

simple scheme
adjacent cells have similar ids, but only in 1 dimension

Cons

lack of any semantic meaning
lookup by index only

Random ids or uuids

Although this is a standard practice in many databases, it has very few advantages in this usecase. We do not care if ids can be guessed as grid cells are not sensitive data.

This approach has all the disadvantages of serial ids and more. For random values to work without many collisions, the domain must be much larger than the largest possible value, leading to larger-than-necessary ids.

Encoding `CELL_COLUMN` and `CELL_ROW`

To further shorten and make more ids more memorable we could encode the cell column and row using an encoding scheme that includes alpha characters.

base64
- efficient and URL safe
- implementations available in every language
- Cons:
  - can lead to unintended words
  - similarity of 0 and O and other characters can be problematic
base32 schemes such as:
- Geohash
- Word-safe alphabet
  - 10.0.0 becomes 10.2.2 (or 10.22.22 padded)
  - 10.632.449 becomes 10.jX.3P
  - 1.6321.4485 becomes 1.V78.8J6 (nice and short!)
- Cons:
  - Less efficient than base64, but unlikely to matter much at the sizes we need
  - Non-standard, so custom implementation is necessary for encoding/decoding
  - Starts at 2 :joy:

Additional context

@prayner @lewisjared if this feature is desirable in the prior, I'll leave it up to you to decide where it might be implemented, although I'm happy to attempt a PR if desired.

It would also probably be good to adapt omGeoJSON.py or create a new script with the sole intention of outputting the grid itself, without requiring all the inputs and processing to generate the prior. Although we'll only need to do it once, I'd like to have a reproducable way to generate a grid that's consumable by the front-end, as we may need to do this again when we introduce new grid resolutions in the future. Happy for this to be a separate issue/PR however.

I'm also aware there are schemes for hierarchical grids, but this didn't seem to fit our usecase as we aren't likely to use hierarchical representations in the model.

For example, we know 10.0.0 contains 1.0.0 through 1.10.10, but this is a happy coincidence due to the grid sizes we've chosen. The same relationship doesn't occur between a 25km grid and a 10km grid.

We are, in fact, pretty likely to use hierarchical grids when we get around to nested runs. Worse yet the nested grids might not be in "nice" numbers since WRF demands integer ratios to the parent grid but no more. So our next nested grid may have a ratio 3 (so 10/3 km) or of 5 (2km). fwiw WRF does this by coding hierarchical grids relative to their parent so listing the ratio in each direction and the starting indices within the parent. Somewhere in the grid description we also should keep the CRS so we can completely define the grid for external use but this shouldn't be in the encoding, just the grid metadata somewhere. Personally I'd push for the i,j scheme with the question of identifying the grid itself to be refined. I think it should be a grid-id,i,j but I don't think the grid-id can refer just to resolution. Over lunch we were discussing what happens when we're asked to redeploy Open Methane over North Asia for example. I think all the needed attributes for a grid description are found in the first element of the observation files which describes the domain.

Hey @prayner, thanks for these comments, they're quite illuminating.

Following on from our earlier conversation, there's a desire to keep Open Methane "domain agnostic" in how it's engineered. Although our domain is very well defined (Australia), it would be beneficial for the web project not to use that knowledge implicitly.

With that in mind, I agree that it would be useful to have part of the openmethane-prior (or setup-wrf) output details about the domain and grid(s) in a way that can be ingested into other systems. I can use that to populate database tables like grid and grid_cell which would form the basis of the web UI and API without hardcoding any implicit knowledge of the domain.

Grid ids can reference a grid.id or similar as the first part, and still use a relative coord system as described above for the i/j values.

The prior is responsible for generating the grid itself

Kind of. Technically setup_wrf is responsible for defining the grid, but I agree that the prior is a better place for this type of calculation.

I agree that we need to be agnostic to the current grid so that we can run nested domains/other domains in future. There will be a bunch of other Grid-specific information (CRS, i/j sizes, resolution), so we can write a JSON file containing that information.

We should also consider whether we want to treat each grid completely separately. Do we want to differentiate a North Asian domain from a nested domain in the Australian context? Throughout the processing, they are (probably) treated independently, but there is a hierarchy that might want to be conserved. I think we need to have terminology to differentiate a parent domain from a nested domain. WRF calls labels domains d0 for the parent domain, d1, d2, ... for nested domains.

The nested domains don't necessarily all overlap. I don't suggest we have to handle that level of complexity, but perhaps we have Region and Grid (or maybe Domain) terminology. I think that maps closer to how things are modelled.

The calculation side is somewhat agnostic to the prefix of these grid ids. GRID_ID could be the combination of the region and grid or there is an extra component in the id

openmethane / openmethane-prior