Closed grovduck closed 1 year ago
@aazuspan, I think this is still pretty rough, but wanted to get some feedback from you in terms of design and specifically the tests. I'll add in some specific comments inlined with the code, but here are the general points:
There is a new mixin (ComponentReducerMixin
) that handles the dimension reduction. I followed the naming of sklearn's PCA
and use n_components
to describe the number of axes to retain. Note that I'm much more used to using the term "axes" instead of "components", but did so to remain consistent. This mixin doesn't do much, but I've introduced it to reduce code duplication for both CCATransformer
and CCorATransformer
.
I've tweaked the CCA
and CCorA
classes to make them more consistent, specifically adding a max_components
property and a projector
method that can take n_components
. This takes some of the logic out of their respective transformers.
The tests are pretty janky. I've got two tests in there to make sure that a transformed X
is reduced in dimensionality correctly and that it throws an error when the user-specified value is out of range. But I don't like the parameterization that I have for n_components
in either test. For the first, I've used [None, 0, 5]
because all of those are valid values for this specific test data. However, because of axis significance tests in CCorA
, it already reduces the number of axes from 28 to 5. Do you think we should first retrieve this value (e.g. CCorA(X, y).max_components
) in the test before running the test. How could you parameterize such an expression? For the second, I'm just testing a negative value and an absurdly high value which both fall out of range, but not psyched about the constant values again.
I haven't yet created the data to test the estimators with reduced transformers, but I'd like to include those as part of this PR. These would be porting tests again, but I'll need to do this outside of yaImpute.
As always, a great review catching lots of little errors. Still need to work on estimator tests, but I think I've addressed most of your comments.
Great, it makes sense to me to include that in this PR too. I assume that means the ability to specify
n_components
when instantiating estimators would happen here, too?
@aazuspan, I've got the bones of this for GNNRegressor
ready to go - both test data and the required additions to the estimator. Before I dump a bunch of new (reduced) test files in, I thought we should discuss what makes sense from a testing perspective. I've run GNN with n=8 components (using pynnmap
) and have those test files ready:
gnn_moscow_ref_distances_k5_c8.csv
gnn_moscow_ref_neighbors_k5_c8.csv
Do we want to change code to accommodate these new files. Is this a one-off for testing or will we want to keep these data files around? I like the "reassurance" that we're getting the correct values, but I don't know how long we want to retain all the porting info? Do you have any opinions on this?
Do we want to change code to accommodate these new files. Is this a one-off for testing or will we want to keep these data files around? I like the "reassurance" that we're getting the correct values, but I don't know how long we want to retain all the porting info? Do you have any opinions on this?
This is a great question. I originally thought the port tests would be temporary, but I agree with you that it's nice to know that everything is staying accurate, as well as functional, so I'm fully in favor of keeping them around in some form or another.
Eventually, I suppose we will reach a point where we're confident everything is working and could theoretically test against previous versions of sknnr
, although I'm not sure exactly how that would happen in practice. EDIT: pytest-regressions and syrupy both look like interesting tools that might solve this problem for us in the future by testing against automatically generated results...
For now, I guess I would lean towards integrating n_components
into our testing system. Maybe it could just be another argument to tests.datasets.load_moscow_stjoes_results
that gets appended (if it's not None
) to the file names?
For now, I guess I would lean towards integrating
n_components
into our testing system. Maybe it could just be another argument totests.datasets.load_moscow_stjoes_results
that gets appended (if it's notNone
) to the file names?
OK, I've got a straw man for testing in there using this suggestion. At present, I'm only testing against a 3-component version of both GNNRegressor
and MSNRegressor
. The test files for GNNRegressor
were created "independently" through pynnmap
, so I'm fairly confident that tests are working as intended. For MSNRegressor
, I "cheated" by actually running sknnr
with reduced components to get the neighbor, distance, and predicted files, so there is no mystery why those tests are passing. I felt a little dirty about doing it this way, but I felt that it was going to take a while to do through yaImpute
and I don't have it set up in pynnmap
. And it will be good for future tests as well. I'm 99.9% sure that the test files I produced really are the correct answer (famous last words ...)
A couple of other things to note:
GNNRegressor
and MSNRegressor
are looking way too similar to not be refactored into a common base class (at some point). We could easily abstract out the transformer in the base class and specify these in the derived classes. Probably good to tackle when I do the CCA
and CCorA
reworking.test_port.py
, functions test_kneighbors_reduced_components
and test_predict_reduced_components
, I have n_components = 3
hard-wired in there. Should we parametrize this even if we don't have other n_components
to test?Sorry this PR is turning into such a beast - I appreciate your help!
EDIT: pytest-regressions and syrupy both look like interesting tools that might solve this problem for us in the future by testing against automatically generated results...
Oooo, super cool. I'd want to play around with these packages to fully understand them, but they seem totally applicable for our use. Nice find!
At first glance I was hoping we could parameterize
n_components
as[None, 3]
and run it with all estimators to avoid duplicating those tests, but I'm guessing you probably already thought of that. I poked around a little while and couldn't come up with any good solution (apparently there's no way to silently skip a test in pytest), so I'd say go ahead and merge if you're ready!
Again, you give me way too much credit. I hadn't thought of removing that duplication, but it makes sense. You're way ahead of me here, but is this what you were thinking?
@pytest.mark.parametrize(
"result", ESTIMATOR_RESULTS.items(), ids=ESTIMATOR_RESULTS.keys()
)
@pytest.mark.parametrize("n_components", [None, 3], ids=["full", "reduced"])
def test_kneighbors(result, n_components):
"""Test that the ported estimators identify the correct neighbors and distances."""
method, estimator = result
if method not in ORDINATION_ESTIMATOR_RESULTS.keys() and n_components is not None:
pytest.skip("These methods do not support variable n_components.")
dataset = load_moscow_stjoes_results(method=method, n_components=n_components)
hyperparams = (
{"n_neighbors": 5}
if method not in ORDINATION_ESTIMATOR_RESULTS.keys()
else {"n_neighbors": 5, "n_components": n_components}
)
est = estimator(**hyperparams).fit(
dataset.X_train, dataset.y_train
)
dist, nn = est.kneighbors(return_dataframe_index=True)
assert_array_equal(nn, dataset.ref_neighbors)
assert_array_almost_equal(dist, dataset.ref_distances, decimal=3)
dist, nn = est.kneighbors(dataset.X_test, return_dataframe_index=True)
assert_array_equal(nn, dataset.trg_neighbors)
assert_array_almost_equal(dist, dataset.trg_distances, decimal=3)
It's a little verbose and has too much logic, but I'm happy to change to this if you like this form better. It's not silently skipping (as you say), but is that OK? I read through that thread you sent and it looks like the optimal solution would be to filter out the combinations that don't make sense (e.g. ("raw", 3)
) as part of collection rather than in the test itself.
Yes, that's almost exactly the solution I came up with. It feels odd to me to have tests that will always be skipped, but maybe that's not a big deal? It is nice not to duplicate the logic of those two tests.
it looks like the optimal solution would be to filter out the combinations that don't make sense (e.g. ("raw", 3)) as part of collection rather than in the test itself.
Correct me if I'm wrong, but my understanding is that this would require combining the two parameterized args (result
and n_components
) into a single arg that contains all valid combinations of result
and n_components
, which seems to me like it would hurt readability and flexibility. Or the other alternative seemed to be monkey patching some new functionality into pytest
, which I also don't love.
Do you have any preference? I think I lean towards either leaving it as-is with duplicated tests or conditionally skipping the invalid tests, and I could live with either.
Correct me if I'm wrong, but my understanding is that this would require combining the two parameterized args (
result
andn_components
) into a single arg that contains all valid combinations ofresult
andn_components
, which seems to me like it would hurt readability and flexibility. Or the other alternative seemed to be monkey patching some new functionality intopytest
, which I also don't love
In the thread you sent, I think this post was the closest I saw to a suggested solution. I'm playing around with it a bit - it looks like you use the pytest_collection_modifyitems
hook to filter out the combinations you don't want. Would you consider that too magic?
If that's too much, I think I'll probably keep the duplicated tests in there, just because it sounds like the consensus is that skip
is meant to denote a test that may pass at some point .
Here's a worked example, copying most of what was in that post:
conftest.py
------------
import pytest
def pytest_configure(config):
config.addinivalue_line(
"markers",
"uncollect_if(*, func): function to unselect tests from parametrization",
)
def pytest_collection_modifyitems(config: pytest.Config, items: list[pytest.Item]):
removed = []
kept = []
for item in items:
m = item.get_closest_marker("uncollect_if")
if m:
func = m.kwargs["func"]
if func(**item.callspec.params):
removed.append(item)
continue
kept.append(item)
if removed:
config.hook.pytest_deselected(items=removed)
items[:] = kept
test_port.py
------------
def uncollect_if(result, n_components):
method, _ = result
if n_components is not None and method not in ORDINATION_ESTIMATOR_RESULTS.keys():
return True
@pytest.mark.uncollect_if(func=uncollect_if)
@pytest.mark.parametrize(
"result", ESTIMATOR_RESULTS.items(), ids=ESTIMATOR_RESULTS.keys()
)
@pytest.mark.parametrize("n_components", [None, 3], ids=["full", "reduced"])
def test_kneighbors(result, n_components):
"""Test that the ported estimators identify the correct neighbors and distances."""
method, estimator = result
dataset = load_moscow_stjoes_results(method=method, n_components=n_components)
hyperparams = (
{"n_neighbors": 5}
if method not in ORDINATION_ESTIMATOR_RESULTS.keys()
else {"n_neighbors": 5, "n_components": n_components}
)
est = estimator(**hyperparams).fit(dataset.X_train, dataset.y_train)
dist, nn = est.kneighbors(return_dataframe_index=True)
assert_array_equal(nn, dataset.ref_neighbors)
assert_array_almost_equal(dist, dataset.ref_distances, decimal=3)
dist, nn = est.kneighbors(dataset.X_test, return_dataframe_index=True)
assert_array_equal(nn, dataset.trg_neighbors)
assert_array_almost_equal(dist, dataset.trg_distances, decimal=3)
with the corresponding output:
PS D:\code\scikit-learn-knn-regression> hatch run test:all tests/test_port.py::test_kneighbors -v
============================================================================= test session starts =============================================================================
platform win32 -- Python 3.10.1, pytest-7.3.1, pluggy-1.0.0 -- C:\Users\gregorma\AppData\Local\hatch\env\virtual\scikit-learn-knn-regression\47ek3DpI\test\Scripts\python.EXE
cachedir: .pytest_cache
rootdir: D:\code\scikit-learn-knn-regression
configfile: pyproject.toml
plugins: cov-4.0.0
collected 10 items / 3 deselected / 7 selected
tests/test_port.py::test_kneighbors[full-raw] PASSED [ 14%]
tests/test_port.py::test_kneighbors[full-euclidean] PASSED [ 28%]
tests/test_port.py::test_kneighbors[full-mahalanobis] PASSED [ 42%]
tests/test_port.py::test_kneighbors[full-gnn] PASSED [ 57%]
tests/test_port.py::test_kneighbors[full-msn] PASSED [ 71%]
tests/test_port.py::test_kneighbors[reduced-gnn] PASSED [ 85%]
tests/test_port.py::test_kneighbors[reduced-msn] PASSED [100%]
======================================================================= 7 passed, 3 deselected in 4.03s =======================================================================
~Ah, I glossed over that post because I was a bit confused by it, to be honest. I'm not very familiar with the pytest
hook system, but it does look like it's probably the cleanest solution, overall. It does feel slightly magic, but it also seems like it's idiomatic for pytest
.~
~If you want to adapt that for our use case, I'm happy with that solution!~
EDIT: Didn't notice your last post!
That looks great, thanks for implementing it! I don't understand exactly how the pytest_collection_modifyitems
hook works, but I think I get the gist of it. Tests reporting as deselected
rather than skipped
makes more sense to me, and it's nice that they don't show up in the verbose output like the skipped tests did. I think this is a good strategy, and just have a few code suggestions for you to consider below:
pytest_collection_modifyitems
to explain what it's doing and link to the Github post.conftest.py
, can we move the marker configuration into pyproject.toml
? Defining the config programmatically feels a little odd, outside of a plugin approach where it needs to be patched in after-the-fact. test_port.py
, can we rename uncollect_if
to something specific, like estimator_does_not_support_ncomponents
(or something shorter)? It would be even better if it could be inlined, but I imagine that would be too long to be easily readable?method not in ORDINATION_ESTIMATOR_RESULTS.keys()
check with not hasattr(method, "n_components")
? I think that accomplishes the same thing and would allow us to get rid of ORDINATION_ESTIMATOR_RESULTS
, but I don't have a strong feeling so let me know if there's a downside I'm not considering.Arrgh, I'm losing steam on this PR ... A couple of blockers/annoyances:
I tried using the new dictionary union operator (|
), but it fails on 3.8. Then I tried this syntax instead d = dict("a"=1); d.update({"b", 2})
, but now sourcery
complains on pre-commit that I should be using the union operator. I think I'm almost ready to ditch sourcery
as a pre-commit because it doesn't look like you can specify options via pyproject.toml
and, because I'm using python 3.10 in development, it always wants me to incorporate the new features (same thing goes for the walrus operator in one case).
We can filter out the impossible combinations of (method, n_components), but in the test we also need to change the hyperpameters based on the estimator type as well. This means we're writing the same check (i.e. hasattr(estimator(), "n_components"
) both in the new estimator_does_not_support_n_components
function as well as the test itself.
In order to use estimator_does_not_support_n_components
in both test_kneighbors
and test_predict
, the function signature has to accomodate result
and n_components
from test_kneighbors
, but also the weighted
parameter from test_predict
. For now, the function signature looks like def estimator_does_not_support_n_components(result, n_components, **kwargs)
, but I'm not sure that's the right approach.
I could use your feedback on these issues. The first is a blocker for actually pushing the code and it might be easier to provide feedback on the second two after I commit/push my new changes.
I tried using the new dictionary union operator (|), but it fails on 3.8. Then I tried this syntax instead d = dict("a"=1); d.update({"b", 2}), but now sourcery complains on pre-commit that I should be using the union operator. I think I'm almost ready to ditch sourcery as a pre-commit because it doesn't look like you can specify options via pyproject.toml and, because I'm using python 3.10 in development, it always wants me to incorporate the new features (same thing goes for the walrus operator in one case).
Wow, dict unions are new to me! I feel like I always miss out on the cool new features by maintaining backwards compatible packages...
Back to the issue at hand, I'm open to dropping sourcery
from our pre-commit. I think I've accepted a few corrections from it that I was ambivalent about, but it's also pretty slow to run and the fact that any new contributor would have to sign up for an account to run our checks is a little off-putting. With the lack of configuration and the Python version issues, I'd say it's doing more harm than good.
As far as how we remove it, I'm totally open to you just disabling it in this PR and then cleaning up the rest of the pre-commit stuff in #41. That's probably not strictly the cleanest way to do it, but I think the alternative of doing #41 first would end up blocking this for a while and probably create a headache when you have to merge in all the changes that will stem from running the pyupgrade
rules. Unless it feels too janky to you to combine this PR with a change to the pre-commit, that's the lazy approach I would suggest.
As recommended, I'll take a closer look at the other questions once we get to a resolution here!
As far as how we remove it, I'm totally open to you just disabling it in this PR and then cleaning up the rest of the pre-commit stuff in #41. That's probably not strictly the cleanest way to do it, but I think the alternative of doing #41 first would end up blocking this for a while and probably create a headache when you have to merge in all the changes that will stem from running the
pyupgrade
rules. Unless it feels too janky to you to combine this PR with a change to the pre-commit, that's the lazy approach I would suggest.
Excellent, all sourcery
code is currently commented out. With #41, I'll leave it to you for the removal (we'll probably want to remove the key as well?). black
did some spacing formatting on .pre-commit-config.yaml
, so hopefully that jives with how it formats on your end.
To your earlier code suggestions:
- Probably add a docstring to
pytest_collection_modifyitems
to explain what it's doing and link to the Github post.
I've done this and added a link to the specific comment in the pytest
issue
- In
conftest.py
, can we move the marker configuration intopyproject.toml
? Defining the config programmatically feels a little odd, outside of a plugin approach where it needs to be patched in after-the-fact.
Cool, I didn't know you could do this here, but I think I've done it correctly.
- In
test_port.py
, can we renameuncollect_if
to something specific, likeestimator_does_not_support_ncomponents
(or something shorter)? It would be even better if it could be inlined, but I imagine that would be too long to be easily readable?
I actually liked your naming suggestion for the function, so I kept it. I didn't inline it because we can use it for both tests (I think you meant to inlining the function within the test itself, right?)
- In the same function above, what do you think about replacing the
method not in ORDINATION_ESTIMATOR_RESULTS.keys()
check withnot hasattr(method, "n_components")
? I think that accomplishes the same thing and would allow us to get rid ofORDINATION_ESTIMATOR_RESULTS
, but I don't have a strong feeling so let me know if there's a downside I'm not considering.
I've done this as well and now ORDINATION_ESTIMATOR_RESULTS
is no longer needed.
Hopefully, the two questions/concerns I raised earlier will now make sense. Thanks for keeping with me on this one - it's turned out to be a bit epic!
This feature introduces a new mixin (
ComponentReducerMixin
) to handle dimension reduction for transformers which create "axes" as linear combinations of theirX
features (presentlyCCATransformer
andCCorATransformer
). For these transformers, there is an upper limit on the number of axes that can be used, either by matrix rank or using tests to determine significance of axes. The user can specify a hyperparameter (n_components
) on these transformers to use fewer than the maximum number of axes, which in turn changes the n-dimensional space for NN finding.A few relevant changes:
CCA
andCCorA
ordination classes now have amax_components
property to ensure consistency so that either can be used in conjunction withComponentReducerMixin
CCA
has been refactored to combine thecoefficients
andaxis_weights
into a single matrix, which has been moved into theprojector
method (similar toCCorA
).CCA
andCCorA
's projector method now accommodates a parameter forn_components
X
data has the correct number of components and raise errors if the user specified value is outside the range [0 -max_components
].