pnwairfire / fccsmap

package supports the look-up of FCCS fuelbed information by lat/lng or vector geo spatial data.
GNU General Public License v3.0
1 stars 1 forks source link

`_create_geo_data_df` doesn't actually reproject inputted GeoJSON #25

Open natjms opened 1 week ago

natjms commented 1 week ago

Some useful context to this issue as I'm going to describe it is that the fccs_canada.nc data set is very broken. The short of it is that while it does indeed have a geotransform described in its metadata, for one reason or another, GDAL won't recognize it. You can force it to recognize it by essentially reading the affine transformation matrix in some other program and then writing the dataset to a new file, passing the geotransform manually:

import rioxarray
import rasterio
from fccsmap.lookup import FccsLookUp
old_ds_filename = FccsLookUp(is_canada=True)._filename
old_ds = rasterio.open(old_ds_filename)
new_ds = rasterio.open('./modified_fccs_canada.nc', 'w', width=old_ds.width, height=old_ds.height, crs=old_ds.crs, count=old_ds.count, nodata=0, dtype='float32', transform=rasterio.transform.Affine(1000,0,-2341249.999999993,0,-1000,3833423.97736665))
raster = rioxarray.open_rasterio(old_ds)
new_ds.write(raster)
old_ds.close()
new_ds.close()

Not that that really matters, because that transform maps points about 30 degrees east of where they should be. I decided to georeference the dataset from scratch to deal with this. If you'd like to include my rework of the fccs_canada.nc dataset, you're welcome to download it here. But if you were to load the dataset into fccsmap manually, you'd have this problem:

>>> from fccsmap.lookup import FccsLookUp
>>> l = FccsLookUp(fccs_fuelload_file='./better_fccs_canada.nc')
>>> l.look_up({'type': 'MultiPoint', 'coordinates':[[51, -116]]})
Traceback (most recent call last):
  [...]
OverflowError: cannot convert float infinity to integer

I've come to assume that this error occurs when a provided point lands outside the domain. That seems to reliably be the case for fccs_fuelload.nc, anyway. If you flip around the coordinates, as you'd need to do for the LCC projection this data set uses, you get this:

>>> l.look_up({'type': 'MultiPoint', 'coordinates':[[-116, 51]]})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/baselookup.py", line 135, in look_up
    stats = self._look_up(new_geo_data)
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/baselookup.py", line 22, in _
    r = func(*args,  **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/lookup.py", line 94, in _look_up
    return self._look_up_in_file(geo_data_df, self._filename)
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/baselookup.py", line 22, in _
    r = func(*args,  **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fccsmap/baselookup.py", line 242, in _look_up_in_file
    stats = zonal_stats(geo_data_df, filename,
  File "/usr/local/lib/python3.10/dist-packages/rasterstats/main.py", line 36, in zonal_stats
    return list(gen_zonal_stats(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/rasterstats/main.py", line 171, in gen_zonal_stats
    fsrc = rast.read(bounds=geom_bounds, boundless=boundless)
  File "/usr/local/lib/python3.10/dist-packages/rasterstats/io.py", line 349, in read
    new_array = self.src.read(
  File "rasterio/_io.pyx", line 627, in rasterio._io.DatasetReaderBase.read
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 633. GiB for an array with shape (1, 412166, 412120) and data type float32

This error looks like it's a result of projection issues. Specifically, you can get these ridiculous attempted allocations when two data sets use different projections, and rasterio tries to allocate enough space to cover both of them. That takes us to this method in fccsmap/baselookup.py:

def _create_geo_data_df(self, geo_data):
    logging.debug("Creating data frame of geo-data")
    shape = shapely.geometry.shape(geo_data)
    wgs84_df = geopandas.GeoDataFrame({'geometry': [shape]}, crs="EPSG:4326")
    return wgs84_df.to_crs(self._crs)

This is where I started thinking this might actually be an issue with fccsmap and not just me trying to cram a square into a circular hole. This method assumes the user provides input using the CRS EPSG:4326, and then attempts to translate it to the particular CRS of the dataset. The crux of this issue seems to be that GeoDataFrame.to_crs does not actually reproject the data frame. I'm guessing it does in a sense, but not in a way recognized by rasterio and GDAL when running zonal_stats. If I'm understanding this correctly, it makes sense that this wouldn't be a problem for fccs_fuelload.nc because it uses the same projection, or at least one that's very similar(?)

GeoPandas has a page on how to reproject a GeoDataFrame using GDAL which would be slower, but you'd think it'd do it in a way that GDAL would subsequently recognize:

def _create_geo_data_df(self, geo_data):
    logging.debug("Creating data frame of geo-data")
   shape = shapely.geometry.shape(geo_data)

    # Create the geo dataframe, initially using EPSG:4326 coordinates
    wgs84_df = geopandas.GeoDataFrame({'geometry': [shape]}, crs="EPSG:4326")

    transformed_geometry = rasterio.warp.transform_geom(
        src_crs=wgs84_df.crs,
        dst_crs=self._crs.to_string(),
        geom=wgs84_df.geometry.values
    )

    # Reproject the dataframe using the coordinate reference system matching
    # that of the selected fuel bed
    reprojected_df = wgs84_df.set_geometry(
        [shapely.geometry.shape(geom) for geom in transformed_geometry],
        crs=self._crs.to_string()
    )

    return reprojected_df

I wasn't able to get this working, unfortunately. Running this modified method throws "invalid latitude" whenever you run look_up, on both fccs_fuelload.nc and better_fccs_canada.nc.

All that being said, if fccsmap is in fact trying to accept coordinates in EPSG:4326 and then map it to the corresponding data set crs, it's not clear why I had to flip the coordinates around earlier. I'd expect it to have failed in the same way without me having to flip them around. I don't have a good answer for why that is.

I was hoping to find a solution that works for all the data sets included in fccsmap, but so far it's eluded me. I have a working patch that lets us access fccs_canada.nc through fccsmap but basically breaks it for the others. We might use that internally at the WFRT, but at this rate it looks like we're going to to take a different path entirely to do fuel bed lookup in Canada. If I do have the time to revisit this and come up with an elegant solution, I'll submit it as a PR.

I don't really expect AirFire to provide support for the Canadian aspect of fccsmap, but since this incredibly long journey lead me to something that could hypothetically be an issue in the future if you folks introduce more data sets, I thought I'd take the time to document it here. Not to mention, I didn't want to leave this investigation as a complete red herring :)

jdubowy commented 1 week ago

Thanks for looking into and documenting this. We (AirFire) don't use fccs_canada.nc (it was added by Miles a couple of years ago for use by UBC), so we're probably not going to spend time in the near future specifically getting it working. However, we should take a look at the potential issue in our use of GeoDataFrame.to_crs. Thanks for pointing that out.

natjms commented 1 week ago

Recently I've started working on integrating another dataset into our fork of fccsmap (also LCC) and was surprised to see that it actually worked flawlessly first try with no modifications. At this rate it seems to me that it's more likely this isn't a problem with geopandas and rasterio, and is in fact another hidden issue with the fccs_canada.nc data set