jwcarr / mantel

Python implementation of the Mantel test, a significance test of the correlation between two distance matrices
MIT License
32 stars 19 forks source link

Add plotting code #7

Closed Mancheron closed 2 years ago

Mancheron commented 2 years ago

Proposition of code refactoring which allows, among others, to produce histograms. This code refactoring also gives the opportunity to compute the correlation threshold for accepting the null hypothesis or the alternative hypothesis.

Mancheron commented 2 years ago

Hi,

First of all, many thanks for your library and for taking my proposition in consideration.

I forked your package because I want to compute the minimal and maximal values for which the null hypothesis is accepted. In my first adaptation, I simply modified the values returned by the test() function by getting the standard deviation instead of the z-score. Then I intended to produce violin plots. It is why I moved most of your code into the compute_correlations() function. Since I want to visually check normality, I wrote the plot() function. At the end, I though that I could make a pull request, but I realized that it would break compatibility of existing program using your library (at least yours), this is why I proposed a separated mantel_test() function which returns what I need ant restore the behavior of your initial test() function. I understand (and mostly agree) with your intention of keeping an simple and intuitive API.

For all these reasons, I made some modifications to my previous request. First of all, I moved the plot() function to a separated file and renamed it to plot_correlations() as you suggested. I also removed the mantel_test() and mantel_test_from_correlations(), but since I need (for my analysis) more flexibility, I opted for an object oriented programming API. This way, the API exposes :

The Mantel class constructor does what initially did your test() function, but allows to access to all information separately.

Concerning the histogram, I'm sorry but I don't understand your comment/request. I want to superpose the histogram of correlation coefficients density to the normal probability density function (with the confidence interval being highlighted). Both the histogram and the normal p.d.f. are drawn using matplotlib. It's a kind of visual control of the normality (I know that it is not enough to ensure the normality, but it is an indicator).

I hope this version will suit you better.

Sincerely, Alban. PS : I apologize for my poor level of English.

jwcarr commented 2 years ago

Thanks! I agree that the codebase could benefit from some modernization and a more flexible API. I will need to take some time to think about all this because there are a lot of proposed changes here, and naturally I don't want to break backwards compatibility. And, if I move to a new API, I would like to keep it simple and get it right first time, so that this package won't require a lot of maintenance going forward.

I'll get back to you soon with more thoughts once I look at the code properly, but one issue that immediately comes to mind is whether it's really appropriate here to take an object-oriented approach. For example, my intuition tells me that one would expect to create a Mantel object and then call some kind of .run() method, but this feels like overkill for such a simple statistical test. The main problem with my code at the moment is that the result returned by test() is very unhelpful. I should be returning some kind of result object that can be inspected in various ways – perhaps a named tuple or dataclass or something more custom. I don't know – need to think some more.

jwcarr commented 2 years ago

Hello again! I've spent a bit more time reviewing your code and thinking about these issues. First of all, thanks again for all the work you've put into this – it's great! However, I think it would be useful to break all this down into smaller chunks. There's a lot of new stuff in this PR, which makes it difficult for me to evaluate all the changes at once.

This package has been on Github for about a decade and on PyPI for a few years, so it's likely being used by people in the wild. I also consider the package to be in a stable/production state, so I hope you understand why I want to be very careful about making any major changes – I don't want to break people's code or dramatically change the API.

That being said, I think there are several good ideas here. I've tried to break everything down into smaller chunks below with some thoughts about each. Let me know if I've missed something.

Addition of type hinting

Personally I'm not a huge fan of type hinting in Python, but I'm happy to add it for the sake of modernization. However, I'm not sure when type hinting was added to the language. Right now the package is listed as compatible with Python 3.5+. Do we need to change this requirement if we add type hinting? Does this also mean we need to check numpy compatibility? Let's come back to type hinting later because this is kind of a separate issue.

Plotting function

Yep, I think it would be nice to add a convenience function for plotting the results of a mantel test, and your code looks good. Let's do this. Regarding my previous comment, I didn't realize there were two separate densities being drawn. However, my point about the y-axis still stands – I think this should be frequency or just hidden completely because density units are not very informative. I see you also considered setting the xlim to [-1,1] – what are the pros and cons of this?

Object oriented approach

If I understood correctly, your proposal is to offer two ways of performing the test: (1) the test() function as before and (2) the Mantel object. In general, I don't think it's a good idea for a package to offer two different ways to do the same thing. However, I do like the idea of returning a MantelResult object from the test() function, which would hold the results of the test. This would be roughly similar to your Mantel object except the computation is performed inside the test() function as before. I've created a new branch and sketched out what I think this class should look like. Let me know what you think: https://github.com/jwcarr/mantel/blob/result_object/mantel/_test.py Under this new version of the code, the expected usage pattern is like this:

import numpy as np
import mantel

n_objects = 10
X = np.random.random((n_objects**2 - n_objects) // 2)
Y = np.random.random((n_objects**2 - n_objects) // 2)

result = mantel.test(X, Y)

print(result)

print(result.r)
print(result.p)
print(result.z)

print(result.correlations)
print(result.mean)
print(result.std)
print(result.tail)

print(result[0])
print(result[1])
print(result[2])

r, p, z = result
print(r, p, z)

result.tail = 'upper'
print(result)

result.tail = 'lower'
print(result)

Importantly, this maintains full backwards compatibility because the MantelResult object can still be indexed and iterated over as if it were a tuple of the form (r, p, z). I like this solution because it offers all the benefits that I think you were looking for, while maintaining the consistency of the API.

Specification of alpha level and evaluation of statistical significance

This is nice, but I would avoid adding this. It's very easy to check if result.p < 0.05, so I don't think we really need to add methods to perform this comparison. Instead, I would give the plot_correlations() function an alpha=0.05 argument, so that the user can set the significance level when plotting, if desired.

Ability to change tail after performing the test

This feels a little bit weird to me because usually one should pick the tail before looking at the results or running the test. Nevertheless, I've added a setter to allow for changing of the tail parameter after the computation of the test result. This feels a bit weird, however, because it feels like the MantelResult object should be immutable. Not sure about this. How does this connect to your desired usage? Would it be acceptable to run the test separately for each tail (even if this means redundant computation)?

General refactor of e.g. parameter validation code and statistical calculation functions

Some of this is possibly a good idea, but this may not be redundant if we go for a more minimal refactor, as I have proposed above.

Mancheron commented 2 years ago

Hi,

First of all, many thanks for spending your time on my proposal.

I should have specified that I'm not a python programmer at all, thus don't consider any subliminal message behind my code.

About "Addition of type hinting" & "Object oriented approach": I added type hinting for my convenience (and an object oriented programming approach for the sake of simplicity). That said, there is no plea for more code modernization.

About the "Ability to change tail after performing the test": I agree with you that having a result class with a tail setter looks very weird but it doesn't matter either (another way is to consider a MantelResult object as immutable, but to provide a copy factory with the possibility to change the tail). I'm not sure about the benefit of such approach. Having the possibility to "mutate" the MantelResult doesn't bother me in the least.

About the "Specification of alpha level and evaluation of statistical significance": Actually, in my first attempt to adapt your code, I didn't integrate this parameter since, as you said, it's easy to check the statistical significance. Nevertheless, I realized that the value to compare relies on the tail method, this is why I chose to add it to the properties (as well as a way to retrieve the significance interval). If you finally intend to re-integrate this parameter, I may suggest keeping the name significance_level instead of alpha (even if this last is shorter). Actually, I initially called this parameter "alpha", but since it may be confusing with the alpha channel parameter of figures, I renamed it as 'significance_level' (which seems to be the official nomenclature for this parameter).

Until this point (=> .), you can notice a global agreement with your point of view (things will change in the next comment).

About the "Plotting function", my objective is to visually check for the normality of the random correlations, thus to compare the histogram density with the normal distribution. So it is my need/intent (and the main reason why I need to refactor your code). You may disagree, but it is my need. Indeed, I also perform test of normality using [among others] Lilliefors, Jarque-Bera and Omnibus, but having this visual superposition is a plus for me.

To be very honest, my first code (before my first pull request) fitted all my needs : performing a one shot visual analysis of some distance matrices compared to some reference one to observe the plasticity of some given parameters. Although, I'm convinced that rms was absolutely right when he said that authors must be informed when their software/libraries are modified and released under a free license, which is the case here (at least out of politeness). Thus I take a little time to write a piece of code that preserves compatibility with your API and allows me to produce my graphics. Of course, on the basis of your feedback I proposed a cleaner version, but it is your library, and you are free to take or not into account any piece of code. You also may want to reject all the propositions, I will not be sad or disappointed. My version allows me to continue my analysis and in one week or two, I will probably never come back to those analysis, so feel free to take whatever you want and whatever you think is an improvement (it's the principle of the MIT license) and throw the rest in the trash.

To summerize, I really thank you for your code and your suggestions. You don't have to spend too much time to convince me of anything. It is your library and your repository. You decide whether to integrate some features into the library or not (and if I'm unhappy, I still can use my fork ; So everything goes well ^-^ ).

jwcarr commented 2 years ago

Thanks for the info. I've pulled in your plotting code and added the MantelResult class, which permits a lot more flexibility in terms of accessing the correlations.

Also, it seems like you are trying to accept the null hypothesis, but (in frequentists statistics) you can only reject the null (and therefore accept the alternative). To me, it would make more sense to show the veridical in green if the p < 0.05 and red if p > 0.05. Did I understand correctly what you are aiming to do?

Mancheron commented 2 years ago

Thanks for the pull. About the color, I agree with you that performing a Mantel test is generally to show that distance matrices are correlated. Although, (in my opinion) the green color is associated to success and the red one to failure (whatever the test). In the Mantel test, even if what is generally expected is to reject the null hypothesis, the fact remains that the Mantel test relates to the null hypothesis. If the test fails, then this hypothesis is rejected in favor of the alternative hypothesis, thus I feel more intuitive to see a red color in such case. Anyway, since the colors for acceptance/rejection are parameters of the plot_correlation() function, I'm not sure that the default setting is of importance since users can set the colors of their choice. By the way, it seems that the release number has not been updated (which may be confusing). I renew all my thanks for your time.

jwcarr commented 2 years ago

Thanks to you too! Your PR prompted me to tidy this package up a bit and add some new useful features. I've just made a few more changes and pushed a new version.