Implementation of Most Similar Neighbor (MSN)

grovduck commented 1 year ago

This PR implements Most Similar Neighbor (MSN) (Moeur and Stage, 1998) which uses Canonical Correlation Analysis to associate X and y matrices. Linear coefficients for both the X and y matrices are calculated such that the calculated fitted scores of each matrix maximize the correlation between matrices.

This implementation generally follows the code presented for method msn in yaImpute. I was unable to directly port the R function cancor into Python. The R implementation uses both QR decomposition and SVD decomposition of matrices to arrive at the X and y coefficients and the QR decomposition can introduce reordering of columns that I was unable to reproduce in Python. I finally stumbled on statsmodels.CanCorr which uses SVD decomposition (exclusively) but throws an error if the matrices do not have full rank after the decomposition. I have borrowed heavily from this implementation and we need to make sure we are adhering to their license. Unfortunately, I wasn't able to figure out how to use sklearn's own CCA to produce the correct output. This may be a place for further follow-up.

Likewise, there is a test (ftest_cor) ported directly from the yai function in yaImpute which (presumably) finds the significant CCorA axes to use based on p-values from an F-test (I think I have this right?). This requires us to include scipy as a requirement, which is already a requirement of scikit-learn.

Moeur, M. and Stage, A.R., 1995. Most similar neighbor: an improved sampling inference procedure for natural resource planning. Forest science, 41(2), pp.337-359.

grovduck commented 1 year ago

Sorry, this review is a bit of a bummer! All the code looks great, but I think we have some work to do to be compliant with the licenses of statsmodels and (especially) yaImpute.

No bummer at all - this is really important to address and I agree that we need to figure this out. I just took a bit of a deep dive myself into yaImpute licensing history and it looks like they are GPL v3 because they relied on libraries that were GPL v3 (specifically ANN) - see here.

You might be able to answer this question, though. When we use another package and call its functionality directly, are we beholden to their licensing in the same way that would be required by porting? For example, scikit-learn uses BSD-3 licensing, but because we're using this code rather than porting this code, are the terms different?

I'm not sure we need to get everything sorted out right now, and maybe we leave this for a final step before the first release, since some of this code will potentially change before then. Let me know what you think. If you're okay with it, I'd be fine to hold off on getting into the nitty gritty of licensing and just make sure we're noting in every docstring where code is ported or adapted from, so that we can come back to it later.

I agree that we probably don't want to high-center on this right now, but we definitely need to create an issue (possibly associated with a milestone!) that tracks this. That's a really good idea to put markers in docstrings where we've ported code from others - that would be a first step before including the actual license information.

This is a new world to me and it all seems a bit squishy. But erring on the side of caution definitely seems appropriate.

aazuspan commented 1 year ago

I just took a bit of a deep dive myself into yaImpute licensing history and it looks like they are GPL v3 because they relied on libraries that were GPL v3 (specifically ANN) - see here.

Good point! It looks like this will depend on which files we ported from then, since they seem to give a very permissive license for anything outside of the ANN code:

=====All other files including src/rfoneprox.cpp ==
All files in this package, other than those refered to above, were prepared by 
Nicholas L. Crookston, a U.S. Government employee working on official time 
and are in the public domain.

So it may be worth noting in docstrings which files were ported, since that will determine the appropriate license.

When we use another package and call its functionality directly, are we beholden to their licensing in the same way that would be required by porting? For example, scikit-learn uses BSD-3 licensing, but because we're using this code rather than porting this code, are the terms different?

No, at least in the case of BSD-3 and MIT there are no restrictions on usage of code, only on redistribution or modification. I think GPL might be a little trickier in this area, and it seems like it's open to interpretation around whether GPL allows "dynamic linking" and whether importing in Python even counts as "dynamic linking".

I agree that we probably don't want to high-center on this right now, but we definitely need to create an issue (possibly associated with a milestone!) that tracks this. That's a really good idea to put markers in docstrings where we've ported code from others - that would be a first step before including the actual license information.

Sounds good! For this PR, can you add those markers for the code included here? After that, I'm good to merge!

This is a new world to me and it all seems a bit squishy. But erring on the side of caution definitely seems appropriate.

Yes, squishy is definitely my impression too! I think the truth is that very few projects follow all the rules correctly, and there's virtually zero chance that anyone notices or cares. We're probably going above and beyond with this level of effort.

grovduck commented 1 year ago

Sounds good! For this PR, can you add those markers for the code included here? After that, I'm good to merge!

@aazuspan, can you check on the markers I left? Two issues:

I used REFERENCE as a marker to be able to refer to these at a later point
I used permalinks for derived lines of codes which exceed line length so I left #noqa: E501s at the ends of these lines

Neither of these decisions am I wedded to, so please suggest alternatives if you'd like.

grovduck commented 1 year ago

... if these are just placeholders then I'm fine with it as is.

Yes, I think these are just placeholders and will eventually be removed. I'm going to merge this one to keep things going, but I've left a note in #29 to remind us to deal with these.

lemma-osu / sknnr

Implementation of Most Similar Neighbor (MSN) #27