martibosch / pylandstats

Computing landscape metrics in the Python ecosystem
https://doi.org/10.1371/journal.pone.0225734
GNU General Public License v3.0
82 stars 16 forks source link

Landscape complexity analysis #12

Closed achennu closed 2 years ago

achennu commented 3 years ago

This PR is regarding issue #8

The plan would be as outlined, to add a top-level function to compute entropy, and then add the methods in Landscape to implement complexity calculations.

The first commit adds the compute_entropy function at the top-level. There are couple of TODO items to be sorted out in the shannon_diversity_index method. I also could not identify which tests should be updated for the method. If we now set a default base for the logarithm, then a test should be created to capture that API choice.

Next, I will add the complexity metrics methods and corresponding tests.

We want methods that can return:

From a usage perspective, we typically want H and U for a landscape, and it's important that all the entropies are calculated with the same logarithmic base. So I suggest a method like complexity_metrics which returns both H and U, which can then be used to order lansdcapes by the two compositional and configuration complexity aspects (like in Figure 2 of the publication).

So basic API questions:

  1. Expose the entropy methods as public on Landscape, or rather _marginal_entropy etc? I think public methods.
  2. Preference to cluster method names? So entropy_marginal, entropy_joint, entropy_mutual_info, etc?

I don't think I'll be able to run tests on py2.7, but I hope you have a test rig for that @martibosch

achennu commented 3 years ago

I've now added another commit with the complexity metrics. Based on looking at the other methods, I settled on three methods: entropy_marginal, entropy_joint and complexity_metrics.

I did not see the value of adding each of the values along the complexity calculation as a separate method.

Currently, there is a repetition of compute_total_adjacency_df in the two entropy methods. This repetition is avoided when using complexity_metrics. I'll leave the caching decision of this calculation up to you, as I don't know what the speed/perf characteristics are. I imagine not much given that it works with dataframes.

Does that look overall ok @martibosch ? If so, the tests can be developed next.

achennu commented 3 years ago

Hi @martibosch , have you had a chance to review this?

martibosch commented 3 years ago

Hello @achennu, my apologies for my delay on this. I have my thesis defense in two days (the 30th), and I will therefore get to it the day after that. Thank you again for your contribution, which I am looking forward to include in pylandstats.

Best, Martí

martibosch commented 3 years ago

Hello @achennu! sorry for the delay in my response. This looks overall great, I just have a few minor comments and suggestions:

  1. I think it would be best to add separate methods instead of the complexity_metrics method, i.e., two methods for the marginal and joint entropy respectively (both are already in your commit), an additional method for the conditional entropy, and two other methods for the mutual information and the relative mutual information respectively. We could then add a notebook in the pylandstats-notebooks repository which covers the new landscape complexity feature and plots different landscapes in the H, U axes. Feel free to submit a pull request there, otherwise I can do it myself once this PR is merged. 1-bis: I would change the name of the entropy metrics as in: marginal_entropy, joint_entropy and conditional_entropy (which are more in line with the names in the landscapemetrics package and also how the other methods in pylandstats are named). The mutual information method names would be mutual_information and relative_mutual_information respectively.

  2. The metrics should be added to the LANDSCAPE_METRICS constant in the Landscape class.

  3. Like contagion or shannon_diversity_index, the new metrics should all warn when the landscape has only one class and return a np.nan (since entropy cannot be computed in such case). If we add 5 new metrics that require such a warning, I believe that we could define a decorator that takes care of that instead of copy-pasting.

  4. The new metrics should be tested. First, we should test that the warnings are raised. Secondly, we should test the value ranges. Finally, we should test the base argument, e.g., by adding it to the test_multilandscape_metric_kws and ensuring that all logarithms computed with base (e.g.) 10 are lower than those computed with base 2.

  5. It is fine that the new metrics have a base=2 as default argument, however I believe that the shannon_diversity_index should have a None as default (which corresponds to the natural logarithm). This is the default definition of Shannon's entropy, and is also the way it's done by default in FRAGSTATS (I belive that full FRAGSTATS compatibility by default is a feature desired by many users) and also in landscapemetrics. Actually, landscapemetrics uses the natural logarithm without letting the user choose the logarithm base. Fixing this would probably also fix the failed tests above. 5-bis: I would like the docstrings of the new methods to be consistent with those of the existing metrics, so two notes:

    1. I leave a white space between the variable name and the colon
    2. if there are default argument values, I put them after the type (with a comma for separation) So in short, e.g., this:
      
      Parameters
      -----------
      base: int
      The base of the logarithm (default 2)

    Returns

    H: float The marginal entropy as a measure of compositional complexity. With a base of 2, this is identical to the definition of Shannon's diversity

    should become something like:

    Parameters

    base : int, default 2 The base of the logarithm

    Returns

    H : float The marginal entropy as a measure of compositional complexity. With a base of 2, this is identical to the definition of Shannon's diversity

    Also in the `compute_entropy` function, I would change:
    

    Parameters

    counts: np.ndarray 1D integer array The number of occurrences of each category base: int The base for logarithm calculation, with default as the natural logarithm (Euler's number).

    
    to:
    

    Parameters

    counts : np.ndarray The number of occurrences of each category as 1D integer array base : int, default None The base for logarithm calculation. With the default value (None), the entropy is computed with the natural logarithm (with Euler's number as base).

  6. Regarding the repetition of compute_total_adjacency_df in the two entropy methods: the _adjacency_df property already uses a cache mechanism so that the full adjacency df (distinguishing vertical and horizontal adjacencies) is only computed once. The aggregation of vertical of horizontal adjacencies is merely a (vectorized) NumPy sum operation which is very fast and I hence believe that there is no problem in repeating its call, and that implementing a cache mechanism for that is not worth it (it might be even counter-productive for most input rasters). Nonehteless, I believe that compute_total_adjacency_df can be defined as a method of Landscape (unlike compute_entropy, which is completely independent of any Landscape instance), so basically we could copy-paste it within the Landscape class, e.g., after the definition of _adjacency_df, and replace the landscape argument for self. I know that you probably took this part from a code snippet that I sent in #8, however, after reviewing this PR I believe that the class method is the way to go.

  7. I use flake8 for style enforcement. You can use it by going to the root of this repository, and running:

    pip install -r requirements-dev.txt

    then

    flake8 .

    and ensure that no error messages appear.

Voilà. I know that I brought up quite a few things (and that some are quite personal and hence subjective), but I understand if you do not want to do them all. Just do what you feel comfortable doing and I will take care of the rest to complete the PR - and of course, your contributions will be noted both in the Contributors section of the GitHub repository and in the Acknowledgements section of the README.

Thank you again. Best, Martí

martibosch commented 3 years ago

Hello @achennu!

do you plan to move this forward or would you prefer if I take the lead? I am fine with both options and your contribution will be properly acknowledged anyway.

Best, Martí

achennu commented 3 years ago

Hi @martibosch

I hope your phd defense went well!

I'm afraid that it's not good timing for me. I won't find sufficient time to work on this in the coming days.

I think we've got the functionality figured out, and you also have coding style that you'd like to follow. I'd be totally ok with you taking over and bringing the PR to completion. Thanks for that!