toddkarin / pvcz

Photovoltaic Climate Zones and Stressors
MIT License
7 stars 3 forks source link

Using float16 lattice coordinates causes incorrect closest lattice point identification #2

Open kandersolar opened 2 years ago

kandersolar commented 2 years ago

The README has this snippet for retrieving the stressor data corresponding to the lattice point closest to the given location of interest:

import pvcz

# Note df is a flattened list of lat/lon values that only includes those over land
df = pvcz.get_pvcz_data()

# Point of interest specified by lat/lon coordinates.
lat_poi = 32
lon_poi = -95.23

# Find the closest location on land to a point of interest
closest_index = pvcz.arg_closest_point(lat_poi, lon_poi, df['lat'], df['lon'])

# Get the stressor data from this location
location_data = df.iloc[closest_index]
print(location_data)

However, the closest_index does not actually correspond to the true closest index. In this case it returns the location lat, lon = 30.875, -96.375. The lattice has 0.25 by 0.25 degree spacing, so this point (which differs by more than a degree in both dimensions) cannot be the closest.

The problem seems to be that the latitude and longitude values returned by get_pvcz_data() have dtype=np.float16 and are insufficiently precise to reliably calculate distance to the point of interest. This is essentially what is happening under the hood:

In [14]: pvcz.haversine_distance(np.array([np.float16(30.875)]), np.array([np.float16(-96.375)]), 32.0, -95.23)
Out[14]: array([0.], dtype=float32)

Because of the precision loss, many nearby lattice points are incorrectly calculated to have zero distance to the location of interest, and only one of them (the first?) is returned. In some informal testing it seems that this might cause the returned point to be some 150 km away from the true nearest point. It's pretty easy to get the correct behavior just by converting the float16s to float64s before calculating any distances, e.g.:

pvcz.arg_closest_point(lat_poi, lon_poi, df['lat'].astype(np.float64), df['lon'].astype(np.float64))

which in this case corresponds to lat, lon = 32.125, -95.125 as expected.

The significance here of course is that this can result in the wrong climate zone being returned for locations near temperature or humidity zone boundaries. Here's a quick look at the effect; red dots are locations where the true pvcz_labeled string is different from the one returned using the README's procedure (blue means there is no difference), and you can see how the red areas correspond to temperature and humidity boundaries.

image

I suspect the best fix would be to modify get_pvcz_data() to convert the coordinate fields (and perhaps any other float16 fields) to float64 before returning. A check in arg_closest_point for multiple points with distance=0 may be worthwhile as well.

kandersolar commented 2 years ago

cc @toddkarin and @computron since it seems like there might be nobody "watching" this repository :)