Indexing: Problems with formula used to compute average grid cell area

rod-glover commented 7 years ago

Context: Converting the R script db/index_netcdf.r to Python: mm_cataloguer/index_netcdf.py. Determining the mean area of a grid cell in a grid defined by vectors of latitude and longitude values (to fill database field Grid.cell_avg_area_sq_km).

Current situation: The formula used in the R script to approximate the area of a grid cell uses a conic approximation for the area of a patch of a sphere bounded by two lines of latitude and two lines of longitude. Its formula is:

A = abs(lon1 - lon2) * abs(lat1 - lat2) * cos(lat1) * R**2

where lat1, lat2, lon1, lon2 are the bounding values in radians, and R is the radius of the earth in km.

To compute an approximation to the mean grid cell area, the following things are assumed:

the entire grid (over which we average) consists of all cells defined by the lat and lon values (this turns out to be important for simplifying this computation)
longitude (lon) is equally spaced
latitude (lat) is stored in ascending order of value by index

So to compute the mean of the approximation above over all cells:

Amean = abs(lon[1] - lon[0]) * mean(diff(lat) * cos(lat[:-1]) * R**2

where abs, diff, and cos apply pointwise to a list, and mean takes the mean of a list (think numpy).

Note: The assumption that lon is equally spaced significantly simplifies the computation; it is O(N) where N is the number of latitude values.

Problem 1: The approximation is accurate only for small differences in latitude, and decreases in accuracy as the latitude approaches a pole. Some grids we encounter at PCIC are coarse (large latitude differences) and approach the poles. This formula will be inaccurate in those cases.

Problem 2: It is unnecessarily complex (in both the mathematical and the computational complexity senses) if you know the exact formula for the area of a patch of a sphere bounded by pairs of lat and lon lines.

Potential response A: The exact formula, valid for all values, for the area of a patch bounded by lat1, lat2, lon1, lon2 is:

A = abs(lon1 - lon2) * abs(sin(lat1) - sin(lat2)) * R**2

The mean area of a grid cell is just the sum of the areas of all grid cells divided by the number of grid cells. The sum of the area of all grid cells is just the area of the patch bounded by the minimum and maximum lat and lon values. Therefore the mean area of a grid cell is:

Amean = (lon[-1] - lon[0]) * (sin(lat[-1]) - sin(lat[0])) * R**2 / (len(lat) * len(lon))

plus some carefulness about wrapping of longitude values.

This is much simpler in form and much less computationally costly than the mean of approximations computation above. Nor does it depend on assumption 2 above (equispaced longitudes).

Potential response B: What is the average cell area used for? Does it actually matter? Is it worth expending the computational effort?

jameshiebert commented 7 years ago

A few brief responses:

Cell area, in general, is important and has applications. They may not specifically be used in our software much right now, but there are common use cases. The primary one is using area-weighted averages when computing climatological statistics over some irregular polygon (that intersects fractions of grid cells). You could probably discuss other use cases with Trevor.
Equally spaced longitude is probably a very safe assumption. But it's also cheap to assert this assumption (and raise an Exception if it's not).
O(n) where n is the number of lats, is probably plenty fast for this code.
You mention the need to watch out for date-line overlap. Also, make sure to consider the GCM-case where you're covering the entire globe (at least in longitude).

rod-glover commented 7 years ago

I'm going to put the basic question up front, then some responses to your points.

Question: Given that the average cell area is not used to find Grid entries in the database and can't be used to compute area-weighted averages (see 1. below), shall we adopt the new computation? Or would you prefer to stick with the existing one?

Responses:

Understood.
- However, this indexing code computes only average cell area and stores it in the modelmeta database. It's not clear what it is actually used for.
- It is not used to find a specific Grid in the database.
- In general it can't be used to compute area-weighted averages, because it is already averaged (it's the area-weighted average of 1.) I am guessing it is a display value used to help users select the data they want to view/download. Stephen Sobie (Trevor is not available today) concurs with this analysis.
- The point of my rather long-winded discussion was that it is unnecessary and sometimes inaccurate to use the approximation formula to compute individual cell areas and then average them all. Instead you can just compute the area of the entire grid and divide by the number of cells.
Good idea, if we keep using the current computation for average cell area. Unnecessary if we adopt the new one.
Yes. The question then turns on the accuracy of the computation. The alternative computation is perfectly accurate for all cases.
Good point.

pacificclimate / modelmeta

Indexing: Problems with formula used to compute average grid cell area #4