openmethane / openmethane-prior

Method to calculate a gridded, prior emissions estimate for methane across Australia
Apache License 2.0
4 stars 0 forks source link

Add a unique identifier for each grid cell (geocoding) #9

Open aethr opened 4 months ago

aethr commented 4 months ago

The motivation

A primary feature of the web interface is to be able to reference and visit individual grid cells. By creating a unique identifier via a geocoding scheme for each grid cell, humans will be able to associate an identifier with an important location.

Examples

Grid cell tiles

When viewing a collection of grid cells, each has an identifier (N63.326 in this example):

image

Viewing a grid cell on the map

When viewing the map, individual grid cells can be selected. The cell identifier is displayed near the top of the left-hand panel (V32.620 in this example), and we can imagine the URL of the view will include the same identifier, like openmethane.org/map/X-123-456:

image

Why generate in the prior?

Q: Since the cell id is a user interface tool, why should we generate it in the prior? It could be generated by the web interface when the grid data is ingested.

A: The prior is responsible for generating the grid itself, so if each cell needs a unique id, this seems like the logical place to generate it. It will also help us correlate user feedback from the web interface, which will use visible identifiers to denote grid cells, with locations in the model, without referencing the database of the web project.

Identifier requirements

The system to generate identifiers should make ids that are:

The proposed solution

I would propose a 3 part scheme: {GRID_ID}.{CELL_COLUMN}.{CELL_ROW}

Pros

Cons

Alternatives

Serial ids

In theory, based on the domain, each grid will have a fixed number of cells:

We can simply use a serial id for each cell based on the order of generation.

Pros

Cons

Random ids or uuids

Although this is a standard practice in many databases, it has very few advantages in this usecase. We do not care if ids can be guessed as grid cells are not sensitive data.

This approach has all the disadvantages of serial ids and more. For random values to work without many collisions, the domain must be much larger than the largest possible value, leading to larger-than-necessary ids.

Encoding CELL_COLUMN and CELL_ROW

To further shorten and make more ids more memorable we could encode the cell column and row using an encoding scheme that includes alpha characters.

Additional context

@prayner @lewisjared if this feature is desirable in the prior, I'll leave it up to you to decide where it might be implemented, although I'm happy to attempt a PR if desired.

It would also probably be good to adapt omGeoJSON.py or create a new script with the sole intention of outputting the grid itself, without requiring all the inputs and processing to generate the prior. Although we'll only need to do it once, I'd like to have a reproducable way to generate a grid that's consumable by the front-end, as we may need to do this again when we introduce new grid resolutions in the future. Happy for this to be a separate issue/PR however.

aethr commented 4 months ago

I'm also aware there are schemes for hierarchical grids, but this didn't seem to fit our usecase as we aren't likely to use hierarchical representations in the model.

For example, we know 10.0.0 contains 1.0.0 through 1.10.10, but this is a happy coincidence due to the grid sizes we've chosen. The same relationship doesn't occur between a 25km grid and a 10km grid.

prayner commented 4 months ago

We are, in fact, pretty likely to use hierarchical grids when we get around to nested runs. Worse yet the nested grids might not be in "nice" numbers since WRF demands integer ratios to the parent grid but no more. So our next nested grid may have a ratio 3 (so 10/3 km) or of 5 (2km). fwiw WRF does this by coding hierarchical grids relative to their parent so listing the ratio in each direction and the starting indices within the parent. Somewhere in the grid description we also should keep the CRS so we can completely define the grid for external use but this shouldn't be in the encoding, just the grid metadata somewhere. Personally I'd push for the i,j scheme with the question of identifying the grid itself to be refined. I think it should be a grid-id,i,j but I don't think the grid-id can refer just to resolution. Over lunch we were discussing what happens when we're asked to redeploy Open Methane over North Asia for example. I think all the needed attributes for a grid description are found in the first element of the observation files which describes the domain.

aethr commented 4 months ago

Hey @prayner, thanks for these comments, they're quite illuminating.

Following on from our earlier conversation, there's a desire to keep Open Methane "domain agnostic" in how it's engineered. Although our domain is very well defined (Australia), it would be beneficial for the web project not to use that knowledge implicitly.

With that in mind, I agree that it would be useful to have part of the openmethane-prior (or setup-wrf) output details about the domain and grid(s) in a way that can be ingested into other systems. I can use that to populate database tables like grid and grid_cell which would form the basis of the web UI and API without hardcoding any implicit knowledge of the domain.

Grid ids can reference a grid.id or similar as the first part, and still use a relative coord system as described above for the i/j values.

lewisjared commented 4 months ago

The prior is responsible for generating the grid itself

Kind of. Technically setup_wrf is responsible for defining the grid, but I agree that the prior is a better place for this type of calculation.

I agree that we need to be agnostic to the current grid so that we can run nested domains/other domains in future. There will be a bunch of other Grid-specific information (CRS, i/j sizes, resolution), so we can write a JSON file containing that information.

We should also consider whether we want to treat each grid completely separately. Do we want to differentiate a North Asian domain from a nested domain in the Australian context? Throughout the processing, they are (probably) treated independently, but there is a hierarchy that might want to be conserved. I think we need to have terminology to differentiate a parent domain from a nested domain. WRF calls labels domains d0 for the parent domain, d1, d2, ... for nested domains.

The nested domains don't necessarily all overlap. I don't suggest we have to handle that level of complexity, but perhaps we have Region and Grid (or maybe Domain) terminology. I think that maps closer to how things are modelled.

lewisjared commented 4 months ago

The calculation side is somewhat agnostic to the prefix of these grid ids. GRID_ID could be the combination of the region and grid or there is an extra component in the id