Closed grovduck closed 1 year ago
Sorry, this review is a bit of a bummer! All the code looks great, but I think we have some work to do to be compliant with the licenses of statsmodels and (especially) yaImpute.
No bummer at all - this is really important to address and I agree that we need to figure this out. I just took a bit of a deep dive myself into yaImpute
licensing history and it looks like they are GPL v3 because they relied on libraries that were GPL v3 (specifically ANN) - see here.
You might be able to answer this question, though. When we use another package and call its functionality directly, are we beholden to their licensing in the same way that would be required by porting? For example, scikit-learn
uses BSD-3 licensing, but because we're using this code rather than porting this code, are the terms different?
I'm not sure we need to get everything sorted out right now, and maybe we leave this for a final step before the first release, since some of this code will potentially change before then. Let me know what you think. If you're okay with it, I'd be fine to hold off on getting into the nitty gritty of licensing and just make sure we're noting in every docstring where code is ported or adapted from, so that we can come back to it later.
I agree that we probably don't want to high-center on this right now, but we definitely need to create an issue (possibly associated with a milestone!) that tracks this. That's a really good idea to put markers in docstrings where we've ported code from others - that would be a first step before including the actual license information.
This is a new world to me and it all seems a bit squishy. But erring on the side of caution definitely seems appropriate.
I just took a bit of a deep dive myself into yaImpute licensing history and it looks like they are GPL v3 because they relied on libraries that were GPL v3 (specifically ANN) - see here.
Good point! It looks like this will depend on which files we ported from then, since they seem to give a very permissive license for anything outside of the ANN code:
=====All other files including src/rfoneprox.cpp ==
All files in this package, other than those refered to above, were prepared by
Nicholas L. Crookston, a U.S. Government employee working on official time
and are in the public domain.
So it may be worth noting in docstrings which files were ported, since that will determine the appropriate license.
When we use another package and call its functionality directly, are we beholden to their licensing in the same way that would be required by porting? For example, scikit-learn uses BSD-3 licensing, but because we're using this code rather than porting this code, are the terms different?
No, at least in the case of BSD-3 and MIT there are no restrictions on usage of code, only on redistribution or modification. I think GPL might be a little trickier in this area, and it seems like it's open to interpretation around whether GPL allows "dynamic linking" and whether importing in Python even counts as "dynamic linking".
I agree that we probably don't want to high-center on this right now, but we definitely need to create an issue (possibly associated with a milestone!) that tracks this. That's a really good idea to put markers in docstrings where we've ported code from others - that would be a first step before including the actual license information.
Sounds good! For this PR, can you add those markers for the code included here? After that, I'm good to merge!
This is a new world to me and it all seems a bit squishy. But erring on the side of caution definitely seems appropriate.
Yes, squishy is definitely my impression too! I think the truth is that very few projects follow all the rules correctly, and there's virtually zero chance that anyone notices or cares. We're probably going above and beyond with this level of effort.
Sounds good! For this PR, can you add those markers for the code included here? After that, I'm good to merge!
@aazuspan, can you check on the markers I left? Two issues:
REFERENCE
as a marker to be able to refer to these at a later point#noqa: E501
s at the ends of these linesNeither of these decisions am I wedded to, so please suggest alternatives if you'd like.
This PR implements Most Similar Neighbor (MSN) (Moeur and Stage, 1998) which uses Canonical Correlation Analysis to associate
X
andy
matrices. Linear coefficients for both theX
andy
matrices are calculated such that the calculated fitted scores of each matrix maximize the correlation between matrices.This implementation generally follows the code presented for method
msn
inyaImpute
. I was unable to directly port the R functioncancor
into Python. The R implementation uses both QR decomposition and SVD decomposition of matrices to arrive at theX
andy
coefficients and the QR decomposition can introduce reordering of columns that I was unable to reproduce in Python. I finally stumbled onstatsmodels.CanCorr
which uses SVD decomposition (exclusively) but throws an error if the matrices do not have full rank after the decomposition. I have borrowed heavily from this implementation and we need to make sure we are adhering to their license. Unfortunately, I wasn't able to figure out how to use sklearn's ownCCA
to produce the correct output. This may be a place for further follow-up.Likewise, there is a test (
ftest_cor
) ported directly from theyai
function inyaImpute
which (presumably) finds the significant CCorA axes to use based on p-values from an F-test (I think I have this right?). This requires us to includescipy
as a requirement, which is already a requirement ofscikit-learn
.