martibosch / pylandstats

Computing landscape metrics in the Python ecosystem
https://doi.org/10.1371/journal.pone.0225734
GNU General Public License v3.0
82 stars 16 forks source link

Using a shapefile for the zones in ZonalAnalysis #7

Closed emuise closed 3 years ago

emuise commented 3 years ago

Description

Is it possible to implement using a shapefile for the mask_arr argument in pls.ZonalAnalysis? I don't need to buffer the zones, just generate the stats within each zone. From what I currently gather mask_arr takes rectangles in the form of numpy arrays.

martibosch commented 3 years ago

Hello @emuise, this is actually a great idea, thanks for bringing it up. It should actually be quite easty to implement with rasterio and geopandas, so I will try to do it over the weekend and release a new version including this. I will keep you posted.

Best, Martí

martibosch commented 3 years ago

Hello @emuise,

I am very sorry for my delay, but these last PhD days are crazy and also I wanted to make some other changes before releasing the new minor. You can now (with PyLandStats 2.2.0) use a shapefile, geodataframe, geoseries or any file compatible with geopandas in your ZonalAnalysis by means of the masks argument (note that masks_arr is now deprecated). The usage is quite straightforward, you can see an example in the 03-zonal-analysis.ipynb notebook, section "Using vector geometries to define zones" (cells 13 and 14).

I hope that this works for you. Feel free to reopen if it doesn't. Thanks again for your suggestion and for using PyLandStats. Best, Martí

emuise commented 3 years ago

Hi Martí, Thanks for doing this. I have ran it on some of my data and am encountering some issues.

I can't use FID for masks_index_col (this I can easily work around on my own, just want to let you know).

When I run .compute_class_metrics_df(metrics = ['proportion_of_landscape']), all of the values are 0 (see attached picture, this applies across all FID_Terr_B, and class_val). pyland

Additionally, I receive this performance warning when running compute_class_metrics_df pyland2

Any ideas what could be causing any of this? Let me know if you need more info from me.

I have double checked that I'm not loading in blank rasters/shps. pyland3

Also, I just want to say I appreciate you adding this feature a lot. Best, Evan

martibosch commented 3 years ago

Hello again @emuise,

the fact that the proportion of landscape is 0 is because there are no patches of that given class in any of your zones. You can pass fillna=False to the compute_class_metrics_df method, and see if the resulting cells are NaN instead.

Your warning seems to be related to the index of the data frame (i.e., a multi index with the class_val and FID_Terr_B). Also, why can't you use 'FID' for masks_index_col? what are the data types of each column of your data frame?

Would it be possible to send me your data so I can try to see what is going on? You can contact me by mail at "marti (dot) bosch (at) epfl (dot) ch" (no spaces, of course). There might be potential PyLandStats bugs so this could be a great opportunity to fix them. But I understand if you cannot share such data.

Best, Martí

emuise commented 3 years ago

Hi @martibosch, I sent you an email with a link to the data on google drive.

Using fillna = False leads to all values being NaN. Could this be something to do with the crs / them not overlapping somehow? The rasters are generated using the shapefiles, so this shouldn't be the case.

The data type for FID is Object ID, from Arc. I can just copy this into another column and use that, easy solve on my end.

Evan

martibosch commented 3 years ago

Hello again @emuise,

This is going to be a long answer, which I will divide into two parts:

1. CRS equality checks

I wonder whether your NaN values are an issue of how the CRS is encoded in the raster and in the shapefile. Pyproj 2.6.1 reads the former as:

<Projected CRS: EPSG:3005>
Name: NAD_1983_BC_Environment_Albers
Axis Info [cartesian]:
- [east]: Easting (metre)
- [north]: Northing (metre)
Area of Use:
- undefined
Coordinate Operation:
- name: unnamed
- method: Albers Equal Area
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

and the latter as:

<Projected CRS: EPSG:3005>
Name: NAD83 / BC Albers
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Canada - British Columbia
- bounds: (-139.04, 48.25, -114.08, 60.01)
Coordinate Operation:
- name: British Columbia Albers
- method: Albers Equal Area
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

hence pyproj.crs.crs.CRS.is_exact_same returns False, although a simple equality comparison (i.e., __eq__ or ==) returns True. This may be the reason why pylandstats does not clip the raster to the shapefile zones properly. Does the following code (replace raster_filepath and zones_filepath with the path to your landscape raster and zone shapefile respectively) return a plot with a

pls.ZonalAnalysis(raster_filepath, masks=zones_filepath).landscapes[0].plot_landscape()

2. On masks_col_index

The following code works in my environment:

import pylandstats as pls

za = pls.ZonalAnalysis(raster_filepath, masks=zones_filepath)
pland_df = za.compute_class_metrics_df(['proportion_of_landscape'], [20])

and pland_df is of the form:

metric                      proportion_of_landscape
class_val attribute_values                         
20        0                                0.353737
          1                                0.000000
          2                                0.465496
          3                                0.538368
          4                                0.450421
...                                             ...
          299                              0.000000
          300                              1.008403
          301                             14.307964
          302                              0.000000
          303                              2.040816

[304 rows x 1 columns]

Does this work for you?

Note however that ZonalAnalysis ommits the zones that do not intersect with the raster (to avoid pointless computations), and since I have not provided a masks_index_col, the above data frame cannot be directly mapped to your shapefile (note the 304 data frame rows vs. the 317 geometries in the shapefile) - this shouldn't be hard, but the whole point of masks_index_col is to make it easier. Nonetheless, I did not understand which column of your shapefile should be used here. Do you have a column with a one-to-one mapping to the geometries? Otherwise, this might be problematic. Looking over your data, it seems that you might be interested in grouping the shapefile rows with the same FID and merging their polygon geometries as a single multipolygon geometry, so that at the end you have this one-to-one mapping between unique FID and unique geometry (even if it is a multipolygon geometry). If that is actually what you want, you can use geopandas as in:

import geopandas as gpd
import pylandstats as pls

gdf = gpd.read_file(zones_filepath)

gb_col = 'FID_Terr_B'
gb = gdf.groupby(gb_col)
group_gser = gpd.GeoSeries(gb.apply(lambda g: g['geometry'].unary_union),
                           index=gb.first().index,
                           crs=gdf.crs)
za = pls.ZonalAnalysis(raster_filepath, masks=group_gser)
pland_df = za.compute_class_metrics_df(['proportion_of_landscape'], [20])

and then you will have a pland_df whose index attribute_values level correspond to 'FID_Terr_B' (or whichever column you set to gb_col in the snippet above.

Let me know how this goes.

Best, Martí

emuise commented 3 years ago

Hi @martibosch, I believe I've got everything working now. For the CRS issue, your code plots a single zonal landscape, which appears to have been the goal.

For the masks_col_index, I have made a new unique ID field (copied from the one ArcGIS gives), which gets it to behave properly. I think Arc's FID field (Data type is "Object ID" in Arc) isn't readable by other programs which is what was throwing pylandstats off when I tried to have mask_index_col = "FID" The missing rows/geometries are due to sliver polygons, which did not include a full pixel. These are now (mostly) removed in my preprocessing steps.

Everything appears to be working properly. Thank you so much. Evan

martibosch commented 3 years ago

Hello @emuise,

glad that everything seems to work out. Your data made me realize that the performance of pylandstats for a ZonalAnalysis with large number of rows (e.g., more than 50) should be improved. I will work towards that in the next releases.

Thank you again for using pylandstats. Best, Martí