martibosch / pylandstats

Computing landscape metrics in the Python ecosystem
https://doi.org/10.1371/journal.pone.0225734
GNU General Public License v3.0
82 stars 16 forks source link

Landscape complexity analysis [feature] #8

Closed achennu closed 2 years ago

achennu commented 3 years ago

Thanks for creating and sharing this nice library @martibosch

I was looking to perform analysis of landscape complexity, and think that pylandstats would be a good host for such functionality. By landscape complexity, I'm specifically referring to the recently published information theoretical framework proposed by Nowosad and Stepinski.

A first version on biorxiv has a more bare description of the methodology: https://www.biorxiv.org/content/10.1101/383281v1 The peer-reviewed version is here, adding more context: https://link.springer.com/article/10.1007/s10980-019-00830-x

Description

Information theory can provide a consistent and universal framework to analyze complexity of landscape patterns. In the referenced study, the authors provide the background to explain that a single metric cannot capture the two inherent terms of landscape complexity, but two metrics are required. The two components of landscape complexity are compositional (diversity of classes) and configurational (spatial pattern), and show that compositional complexity is always the major component of landscape complexity.

They propose two metrics that can be used to analyze the complexity of a landscape: H and U. H is the well-described (Shannon) marginal entropy of the p log(p) form. U is the Mutual Information criterion (eq 5) divided by the marginal entropy, i.e., U = I(x,y) / H(y) = (1 - H(y|x) / H(y)). Essentially, this relies on an adjacency matrix for the focus cell containing a particular class.

Together a proposed visualization, the HYU diagram, helps project landscape patterns into a 2D space where landscapes with a different number of classes but with the same Shannon entropy (H) can be differentiated with their configurational complexity using U. See Fig 2A.

Implementation

A reference implementation from the authors is available in the R package landscapemetrics

Having looked into the Landscape class, I see that an internal dataframe is created with the horizontal and vertical adjacencies for a given landscape raster. I suppose this is the Rook's pattern (4 neighbors) adjacency, which is also what the original work uses for the analysis.

I'd like to help work up some code to produce the H & U metrics for a given landscape. H is pretty simple, as it just requires the proportional counts (p) of each class in the landscape to be used in the p log(p) summation. The mutual information I(x,y) requires the calculation of the conditional entropy H(x|y) (eq 2), ie, the summation of p log(p) where the p refers to the proportion of adjacent cells having a class x given that the focus cell has a class label y (if I've understood that correctly).

Could you indicate if this would be a welcome addition to the pylandstats library?

Some guidance on how to use the cached adjacency dataframe for calculating H(x|y) would be useful!

martibosch commented 3 years ago

Hello @achennu,

first of all, thank you for using PyLandStats. I would of course welcome such a contribution. I am actually releasing a new minor (v2.2.0) today, but I would like to release another one in one or two weeks, including a "moving window" metric computation and potentially the feature that you propose.

The adjacency matrix is computed using the Von Neuman neighbourhood (i.e., the Rook/4-neighbour adjacency), so it should be not to hard to compute the mutual information (given what I read in page 2094/equation 2 of the article at "Landscape Ecology") from the Landscape._adjacency_df property - the data frame is quite self-explanatory. I am not too familiar with R, but what I did in cases like this was to use the rocker/geospatial Docker image to run Rstudio in my browser and check what the main lines of the mutual information metric do in landscapemetrics so that I can implement it using NumPy/pandas.

I hope that this is helpful, do not hesitate to ask for more guidance if you needed. You are of course more than welcome to submit a pull request, and your contribution will be mentioned in the Acknowledgments section of the README (as well as in the "Contributions" panel of the main repository page). I do however require new features to be tested. Otherwise, I can also implement it for you.

Best regards, Martí

achennu commented 3 years ago

Thanks for the feedback.

I did consider to make a pull request, but ran into trouble in getting a development install of pylandstats. I had hoped that installing the dependencies in a conda env followed by pip install -e pylandstats/ would work, but this threw some errors and exited. It would be useful if the Readme.md contained a set of (conda and pip) commands to run to create a development install.

I am not too familiar with R, but what I did in cases like this was to use the rocker/geospatial Docker image to run Rstudio in my browser and check what the main lines of the mutual information metric do in landscapemetrics so that I can implement it using NumPy/pandas.

Those are the correct lines, yes, and they call C++ functions in the background. This is likely for speed of computation.

I do however require new features to be tested. Otherwise, I can also implement it for you. That's a kind offer, and I may have to come back to that. But first, I'll try to work up a few functions and a basic API, and we can review it then.

+1 for testing. What would a test case look like here? I guess we want to have artificial landscapes where we know what the H would be and what U would be?

martibosch commented 3 years ago

Hello again @achennu, thank you for your dedication to this.

A development install of PyLandStats might inded be troublesome (because it uses the transonic library to speed up the computation of the adjacency matrix), so it is indeed a good idea to add a section covering it. In the meantime, you might try the following:

Regarding the tests: for new metrics, I basically test that the computed values fall within the appropriate ranges, so adding tests for the new metrics in the test_landscape_metrics_value_ranges function should be enough. For the entropy and the mutual information, such a range is just greater or equal to zero.

Let me know if you still want to develop this feature, and if so, let me know if you managed to install PyLandStats in development mode. Otherwise, I can implement the feature myself. Best, Martí

achennu commented 3 years ago
* If you have Linux/Mac, you can try:

  1. creating a new conda environment as in (the name does not have to be 'pylandstats'): `conda create -n pylandstats -c conda-forge python pip setuptools wheel pythran gast transonic numpy pandas geopandas matplotlib-base rasterio scipy openblas`
  2. activate the new environment: `conda activate pylandstats`
  3. install PyLandStats in development mode: `pip install -e path/to/pylandstats/ --no-build-isolation`

This worked just fine. I was able to import pylandstats with that in the installed python 3.8 env. Could go to the repo's README for the next release.

But, that's a pretty heavy install. Is that mainly because of transonic dependence? What functionality of transonic is useful for adjacency matrix calculations - Cython?

And just curious: why is --no-build-isolation necessary?

Regarding the tests: for new metrics, I basically test that the computed values fall within the appropriate ranges, so adding tests for the new metrics in the test_landscape_metrics_value_ranges function should be enough. For the entropy and the mutual information, such a range is just greater or equal to zero.

Sounds reasonable. I'll try to put that together.

Let me know if you still want to develop this feature, and if so, let me know if you managed to install PyLandStats in development mode. Otherwise, I can implement the feature myself.

I'll give it a go and get back to you in a few days. Thanks for the help!

martibosch commented 3 years ago

I know this is quite a heavy install. Indeed, this is because of the transonic/Pythran dependence, which is used to very substantially speed up the adjacency matrix calculations. I had also tried Cython, Numba and pybind11 and Pythran worked the best. The --no-build-isolation is necessary so that pip does not install the build dependencies and instead Pythran can take them from the conda environment. I am pinging @paugier here (sorry again) just to make sure that what I said is correct and that there is no simpler way for a dev install (although I could probably write an environment-dev.yml to make things a bit easier).

Thank you again for taking time to develop this new feature.

Best, Martí

paugier commented 3 years ago

I think the dev install process can be simplified (I don't think the --no-build-isolation is needed) and there is no need to install pythran, gast and transonic with conda (neither setuptools and wheel). The difficulty is not about using Transonic but is related to Pythran. Since pylandstats uses Pythran for its build, the build requires a recent C++ compiler and (maybe, depending on the code) a dev install of openblas. However, nowadays, it's very simple to install anywhere (with your OS package manager or with conda, packages clangdev and "blas-devel[build=*openblas]"). One may need to use a file .pythranrc containing something like:

[compiler]
CXX=clang++
CC=clang
blas=openblas

On my PC (Linux), I was able to install with

conda create -n env_pylandstats pandas geopandas matplotlib-base rasterio scipy openblas
conda activate env_pylandstats
cd ~/Dev/pylandstats
pip install -e .

2 remarks for @martibosch

martibosch commented 3 years ago

Hello @paugier,

thank you again for your thorough answer. I will explore the development install and add a section about it in the README.

Regarding the conda recipe runtime dependencies: should I remove Pythran only or go further and remove Pythran, numba and gast? I suppose that these are transonic dependencies and that transonic will automatically detect whether we are on a Linux/Mac/Windows machine and install them accordingly, right?

Thank you again. Best, Martí

paugier commented 3 years ago

You can remove pythran and gast from the runtime dependencies. In contrast, Numba is required at runtime (only for Windows of course). Transonic does not try to install anything (Numba is used only if it is installed).

Of course the other solution is to use Pythran also on Windows (I guess with clang). In that case, no need for the conditions depending on the OS in the code and no need for Numba.

achennu commented 3 years ago

Coming back to the topic of this issue..

screen-adjacency

This is a print of the _adjacency _df calculated from a landscape of 4 classes. I just want to make sure I'm understanding this correctly. If we look at the column 3, the dataframe tells me that:

Is this correct?

I don't understand the use of 0 (self.nodata). I supplied a landscape with no cells marked nodata, so where is this information coming from? From the padding used in the adjacency matrix calculation? How is that differentiated from cells marked 0 in the supplied landscape array?

martibosch commented 3 years ago

You can remove pythran and gast from the runtime dependencies. In contrast, Numba is required at runtime (only for Windows of course). Transonic does not try to install anything (Numba is used only if it is installed).

Of course the other solution is to use Pythran also on Windows (I guess with clang). In that case, no need for the conditions depending on the OS in the code and no need for Numba.

Thank you @paugier! I will remove the pythran and gast runtime dependencies and I will then use the Azure pipelines for the pylandstats-feedstock to test conda recipes which use Pythran on Windows (I should nonetheless check some benchmarks on whether Pythran runs faster than Numba in Windows, and therefore see whether I leave the Numba runtime dependency for Windows).

martibosch commented 3 years ago

Coming back to the topic of this issue..

screen-adjacency

This is a print of the _adjacency _df calculated from a landscape of 4 classes. I just want to make sure I'm understanding this correctly. If we look at the column 3, the dataframe tells me that:

* For all cells with the class `3`, then in the horizontal direction the adjacent cells were also `3` 70 times, and were the class `4` 60 times and class `1` 63 times.

* For all cells with the class `2`, in the vertical direction adjacent cells had the class `3` 61 times, and class `4` 69 times.

Is this correct?

I don't understand the use of 0 (self.nodata). I supplied a landscape with no cells marked nodata, so where is this information coming from? From the padding used in the adjacency matrix calculation? How is that differentiated from cells marked 0 in the supplied landscape array?

Hello @achennu! yes, you are understanding the adjacency matrix correctly. I suppose that you noted that the horizontal and vertical components are symmetric matrices (since pixel adjacencies are symmetric).

Regarding the nodata value: I assume that you instantiated the Landscape with a numpy array as the landscape argument. In such case, you need to explicitly tell pylandstats which value corresponds to nodata, otherwise pylandstats assumes it to be zero (this was hardcoded in Landscape.__init__, but I have just pushed a commit in the development branch so that the user can use the settings module to set the default nodata value). As you guessed, if your landscape does not have nodata pixels, the row (and column, since we have a symmetric matrix) for the nodata class corresponds to the border of your landscape (hence the padding used in the adjacency matrix computation). So although the nodata value of 0 used to index the row (and column) in your adjacency matrix does not have any relationship to the pixel values of your array, it does feature the number of adjacencies between pixels of each (valid) class and the border of your landscape, which might be useful information in some cases.

Thank you again for your contributions. Best, Martí

achennu commented 3 years ago

Hi Martí, sorry for the lapse. I got inundated by other work for a bit. I've tried to make progress on this. I'll try to explain where I landed with my efforts.

I start with a synthetic landscape which have two classes equally distributed.

import numpy as np
SIZE = (32, 32)
landscape_data = np.ones(SIZE, dtype=int)
# quadrants
landscape_data[:16, :16] = 2
landscape_data[16:, 16:] = 2

import pylandstats as pls
landscape = pls.Landscape(landscape_data, res=(1, 1))
adjmat = landscape._adjacency_df

The marginal entropy (or Shannon index) is the sum of the probability of occurrence of each cell

SHDI = landscape.shannon_diversity_index()

Pclasses = np.bincount(landscape_data.ravel()) / landscape_data.size
marginal = 0
for Pclass in Pclasses:
    if Pclass > 0:
        marginal += Pclass * np.log2(Pclass)
# with a log2 definition this should be 1.0 as the Shannon index

marginal *= -1
print(f'marginal: {marginal} SHDI: {SHDI}')

marginal: 1.0 SHDI: 0.6931471805599453

Next, we consider subsets of cell pairs such that a class of the focus cell is fixed. In such subset, the class of the adjacent cell is an univariate random variable 𝑦|𝑥=𝑐𝑖 taking values 𝑦={𝑐1,…,𝑐𝐾}

To obtain the full account of distribution of adjacencies we use the IT concept of conditional entropy, eq

What is the distribution of adjacencies? We don't care if it's vertical or horizontal, so we sum the adjacencies along this dimension

adjacency = adjmat.sum(level=[1])

Along columns is the focus cell classes Along rows are the adjacent cell classes We do not want to consider the nodata class 0, so we slice it out

adjacency = adjacency.loc[adjacency.columns[:-1], adjacency.columns[:-1]]
print(adjacency)

        1     2
1  1920    64
2    64  1920

What is the 'probability' of the adjacencies?

adjacency_P = adjacency / landscape_data.size

print(adjacency_P)

            1       2
1  1.8750  0.0625
2  0.0625  1.8750

Now calculate P logP also with base 2

PlogP = (adjacency_P * np.log2(adjacency_P)).fillna(0)
print(PlogP)

             1        2
1  1.70042 -0.25000
2 -0.25000  1.70042

Calculate the conditional and joint entropies

cond_entropy = -1 * PlogP.sum(axis=0).sum()
joint_entropy = marginal + cond_entropy
print('conditional: {cond_entropy}')
print('joint: {joint_entropy}')

conditional: -2.9008397335319445
joint: -1.9008397335319445

So apart from using log2 base as in the original definition (and comparing to natural-log SHDI from the Landscape class), this is how I've reasoned through the adjacency matrix.

The mutual information is (eq 5) which states: I(y,x) = H(y) - H(y|x) So I(y,x) = marginal - conditional = 1.0 - (-2.9008) = 3.9008

Jensens inequality must be satisfied, that is, I(y,x) >= 0. And this is true.

So, this should overall be ok -- assuming how I've worked with the adjacency matrix is correct.

What is missing in my implementation is a clear/correct usage (or understanding) of the 'landscape metadata' such as the size, and properties of patches self._get_patch_area_ser(class_val) etc.

I've worked in the numpy world, and don't know how to convert it to the Landscape world. Might be worth checking against the R code you posted too.

Let me know what you think.

martibosch commented 3 years ago

Hello @achennu, I am sorry that I have been absent for so long. I try to be very responsive, especially not to discourage nice contributions such as yours, but I am finishing my PhD thesis and these last months have been a bit crazy. I will go through your comment above tomorrow and see how we can merge this into pylandstats.

Thank you. Best, Martí

martibosch commented 3 years ago

Dear @achennu,

thank you again for your contribution. I have gone through your example code, the article by Nowosad and Stepinski as well as the code of landsacpemetrics, and I have come up with the following code:

import numpy as np
import pylandstats as pls
SIZE = (32, 32)
landscape_data = np.ones(SIZE, dtype=int)
# quadrants
landscape_data[:16, :16] = 2
landscape_data[16:, 16:] = 2

landscape = pls.Landscape(landscape_data, res=(1, 1))

I have DRYed the computation of a data frame with the total adjacencies (vertical and horizontal), and the computation of the entropy of a given one-dimensional array (and also with any base):

def compute_total_adjacency_df(landscape):
    # first `sum` to sum vertical and horizontal adjacencies (first-level index),
    # then ` loc` to overlook the nodata row/column
    return landscape._adjacency_df.sum(
        level=[1]).loc[landscape.classes, landscape.classes]

def compute_entropy(xs, base=None):
    _xs = xs / xs.sum()
    ent = - np.sum(_xs * np.log(_xs))
    if base:
        ent = ent / np.log(base)
    return ent

Then I compute these variables that concern the adjacencies:

# get data frame of total adjacencies (i.e., ignoring whether they are vertical
# or horizontal)
adjacency_df = compute_total_adjacency_df(landscape)

# `sum` to get each column's sum
adjacency_counts = adjacency_df.sum()

# this would be the "adjacency vector", I just find it more Pythonic to add an
# "s" suffix for one-dimensional ndarrays (e.g., vectors, lists...)
adjacencies = adjacency_df.values.flatten()

And the metrics can be computed as follows:

# compute the marginal entropy
marg_ent = compute_entropy(adjacency_counts, base=2)
marg_ent
1.0
# compute the joint entropy
joint_ent = compute_entropy(adjacencies, base=2)
joint_ent
1.205592508185083
# compute the conditional entropy
cond_ent = joint_ent - marg_ent
cond_ent
0.2055925081850829
# compute the mutual information
mutual_info = marg_ent - cond_ent
mutual_info
0.7944074918149171
# compute the relative mutual information
rel_mutual_info = mutual_info / marg_ent
rel_mutual_info
0.7944074918149171

There are two ways to move forward from here:

a) you submit a pull request. In such a case, I will provide you with further guidelines on how and where to put each function and we will move this discussion to the pull request thread. You will automatically appear in the "Contributors" block, and I will also add an acknowledgement at the end of the README.

b) I take care of adding these functionalities to pylandstats, and I add an acknowledgement of your contribution at the end of the README

Let me kno what you think. Best, Martí

achennu commented 3 years ago

Hey @martibosch Cool that we're back on the issue.

Two minor points to clarify:

  1. Doesn't this produce just a homogenous landscape?
# quadrants
landscape_data[:16, :16] = 2
landscape_data[16:, 16:] = 2
  1. Shouldn't mutual_info be equal to 1.0, because it would be 1.205592508185083 - 0.2055925081850829.

Maybe just a copy-paste error on both?

Overall it seems good, thanks for checking it out!

I'm happy to make a pull request and then discuss the API layout there. I think it would be nice to aim for a compute_HYU sort of top-level. I'll set up the PR sometime this week.

martibosch commented 3 years ago

Hello @achennu,

regarding your two points:

  1. I copied your code snipped to create the landscape_data array, it creates a landscape with two quadrans filled with ones and two filled with twos.

    I start with a synthetic landscape which have two classes equally distributed.

    
    import numpy as np
    SIZE = (32, 32)
    landscape_data = np.ones(SIZE, dtype=int)
    # quadrants
    landscape_data[:16, :16] = 2
    landscape_data[16:, 16:] = 2
  2. The mutual_info is the difference between the marginal entropy and the conditional entropy (equation 5 of the paper by Nowosad and Stepinski). I actually checked and I obtain the same values with landscapemetrics for all the above metrics.

Regarding the pull request, two quick comments before we move the discussion there:

  1. I believe the best way to add these functions is as a methods to compute the metrics in the Landscape class, more precisely, after the landscape-level metrics comment in landscape.py.
  2. We could take this opportunity to DRY the computation of the entropy for any given vector, as done with the compute_entropy method in my code snippet above. I believe a function like this could live before the definition of the Landscape class in landscape.py, like the compute_adjacency_arr method. I'd suggest to first use a commit which adds this function and uses it in the shannon_diversity_index method, and then add the methods for the new metrics as a separate commit.

Thank you again for your contributions. Best, Martí

martibosch commented 2 years ago

Hello @achennu,

thank you again for your work on this one. This is finally released in v2.4.0. You can see an example HYU diagram in the new cluster analysis notebook.

You now appear among the "Contributors" of the repository as well as in the "Acknowledgments" section of the README. Thank you again for your suggestion and your work. Best, Martí

achennu commented 2 years ago

Thanks!

That's a great introduction to the cluster analysis feature in the notebook.