Allow for user to specify reduced number of components (axes) #33

lemma-osu / sknnr

scikit-learn compatible estimators for various kNN imputation methods

https://sknnr.readthedocs.io

0 stars 1 forks source link

Allow for user to specify reduced number of components (axes) #33 #40

Closed grovduck closed 1 year ago

grovduck commented 1 year ago

This feature introduces a new mixin (ComponentReducerMixin) to handle dimension reduction for transformers which create "axes" as linear combinations of their X features (presently CCATransformer and CCorATransformer). For these transformers, there is an upper limit on the number of axes that can be used, either by matrix rank or using tests to determine significance of axes. The user can specify a hyperparameter (n_components) on these transformers to use fewer than the maximum number of axes, which in turn changes the n-dimensional space for NN finding.

A few relevant changes:

CCA and CCorA ordination classes now have a max_components property to ensure consistency so that either can be used in conjunction with ComponentReducerMixin
CCA has been refactored to combine the coefficients and axis_weights into a single matrix, which has been moved into the projector method (similar to CCorA).
Both CCA and CCorA's projector method now accommodates a parameter for n_components
Tests have been added to ensure that transformed X data has the correct number of components and raise errors if the user specified value is outside the range [0 - max_components].

grovduck commented 1 year ago

@aazuspan, I think this is still pretty rough, but wanted to get some feedback from you in terms of design and specifically the tests. I'll add in some specific comments inlined with the code, but here are the general points:

There is a new mixin (ComponentReducerMixin) that handles the dimension reduction. I followed the naming of sklearn's PCA and use n_components to describe the number of axes to retain. Note that I'm much more used to using the term "axes" instead of "components", but did so to remain consistent. This mixin doesn't do much, but I've introduced it to reduce code duplication for both CCATransformer and CCorATransformer.
I've tweaked the CCA and CCorA classes to make them more consistent, specifically adding a max_components property and a projector method that can take n_components. This takes some of the logic out of their respective transformers.
The tests are pretty janky. I've got two tests in there to make sure that a transformed X is reduced in dimensionality correctly and that it throws an error when the user-specified value is out of range. But I don't like the parameterization that I have for n_components in either test. For the first, I've used [None, 0, 5] because all of those are valid values for this specific test data. However, because of axis significance tests in CCorA, it already reduces the number of axes from 28 to 5. Do you think we should first retrieve this value (e.g. CCorA(X, y).max_components) in the test before running the test. How could you parameterize such an expression? For the second, I'm just testing a negative value and an absurdly high value which both fall out of range, but not psyched about the constant values again.
I haven't yet created the data to test the estimators with reduced transformers, but I'd like to include those as part of this PR. These would be porting tests again, but I'll need to do this outside of yaImpute.

grovduck commented 1 year ago

As always, a great review catching lots of little errors. Still need to work on estimator tests, but I think I've addressed most of your comments.

grovduck commented 1 year ago

Great, it makes sense to me to include that in this PR too. I assume that means the ability to specify n_components when instantiating estimators would happen here, too?

@aazuspan, I've got the bones of this for GNNRegressor ready to go - both test data and the required additions to the estimator. Before I dump a bunch of new (reduced) test files in, I thought we should discuss what makes sense from a testing perspective. I've run GNN with n=8 components (using pynnmap) and have those test files ready:

gnn_moscow_ref_distances_k5_c8.csv
gnn_moscow_ref_neighbors_k5_c8.csv

Do we want to change code to accommodate these new files. Is this a one-off for testing or will we want to keep these data files around? I like the "reassurance" that we're getting the correct values, but I don't know how long we want to retain all the porting info? Do you have any opinions on this?

aazuspan commented 1 year ago

Do we want to change code to accommodate these new files. Is this a one-off for testing or will we want to keep these data files around? I like the "reassurance" that we're getting the correct values, but I don't know how long we want to retain all the porting info? Do you have any opinions on this?

This is a great question. I originally thought the port tests would be temporary, but I agree with you that it's nice to know that everything is staying accurate, as well as functional, so I'm fully in favor of keeping them around in some form or another.

Eventually, I suppose we will reach a point where we're confident everything is working and could theoretically test against previous versions of sknnr, although I'm not sure exactly how that would happen in practice. EDIT: pytest-regressions and syrupy both look like interesting tools that might solve this problem for us in the future by testing against automatically generated results...

For now, I guess I would lean towards integrating n_components into our testing system. Maybe it could just be another argument to tests.datasets.load_moscow_stjoes_results that gets appended (if it's not None) to the file names?

grovduck commented 1 year ago

For now, I guess I would lean towards integrating n_components into our testing system. Maybe it could just be another argument to tests.datasets.load_moscow_stjoes_results that gets appended (if it's not None) to the file names?

OK, I've got a straw man for testing in there using this suggestion. At present, I'm only testing against a 3-component version of both GNNRegressor and MSNRegressor. The test files for GNNRegressor were created "independently" through pynnmap, so I'm fairly confident that tests are working as intended. For MSNRegressor, I "cheated" by actually running sknnr with reduced components to get the neighbor, distance, and predicted files, so there is no mystery why those tests are passing. I felt a little dirty about doing it this way, but I felt that it was going to take a while to do through yaImpute and I don't have it set up in pynnmap. And it will be good for future tests as well. I'm 99.9% sure that the test files I produced really are the correct answer (famous last words ...)

A couple of other things to note:

GNNRegressor and MSNRegressor are looking way too similar to not be refactored into a common base class (at some point). We could easily abstract out the transformer in the base class and specify these in the derived classes. Probably good to tackle when I do the CCA and CCorA reworking.
In test_port.py, functions test_kneighbors_reduced_components and test_predict_reduced_components, I have n_components = 3 hard-wired in there. Should we parametrize this even if we don't have other n_components to test?

Sorry this PR is turning into such a beast - I appreciate your help!

grovduck commented 1 year ago

EDIT: pytest-regressions and syrupy both look like interesting tools that might solve this problem for us in the future by testing against automatically generated results...

Oooo, super cool. I'd want to play around with these packages to fully understand them, but they seem totally applicable for our use. Nice find!

grovduck commented 1 year ago

At first glance I was hoping we could parameterize n_components as [None, 3] and run it with all estimators to avoid duplicating those tests, but I'm guessing you probably already thought of that. I poked around a little while and couldn't come up with any good solution (apparently there's no way to silently skip a test in pytest), so I'd say go ahead and merge if you're ready!

Again, you give me way too much credit. I hadn't thought of removing that duplication, but it makes sense. You're way ahead of me here, but is this what you were thinking?

@pytest.mark.parametrize(
    "result", ESTIMATOR_RESULTS.items(), ids=ESTIMATOR_RESULTS.keys()
)
@pytest.mark.parametrize("n_components", [None, 3], ids=["full", "reduced"])
def test_kneighbors(result, n_components):
    """Test that the ported estimators identify the correct neighbors and distances."""
    method, estimator = result

    if method not in ORDINATION_ESTIMATOR_RESULTS.keys() and n_components is not None:
        pytest.skip("These methods do not support variable n_components.")
    dataset = load_moscow_stjoes_results(method=method, n_components=n_components)

    hyperparams = (
        {"n_neighbors": 5}
        if method not in ORDINATION_ESTIMATOR_RESULTS.keys()
        else {"n_neighbors": 5, "n_components": n_components}
    )

    est = estimator(**hyperparams).fit(
        dataset.X_train, dataset.y_train
    )

    dist, nn = est.kneighbors(return_dataframe_index=True)
    assert_array_equal(nn, dataset.ref_neighbors)
    assert_array_almost_equal(dist, dataset.ref_distances, decimal=3)

    dist, nn = est.kneighbors(dataset.X_test, return_dataframe_index=True)
    assert_array_equal(nn, dataset.trg_neighbors)
    assert_array_almost_equal(dist, dataset.trg_distances, decimal=3)

It's a little verbose and has too much logic, but I'm happy to change to this if you like this form better. It's not silently skipping (as you say), but is that OK? I read through that thread you sent and it looks like the optimal solution would be to filter out the combinations that don't make sense (e.g. ("raw", 3)) as part of collection rather than in the test itself.

aazuspan commented 1 year ago

Yes, that's almost exactly the solution I came up with. It feels odd to me to have tests that will always be skipped, but maybe that's not a big deal? It is nice not to duplicate the logic of those two tests.

it looks like the optimal solution would be to filter out the combinations that don't make sense (e.g. ("raw", 3)) as part of collection rather than in the test itself.

Correct me if I'm wrong, but my understanding is that this would require combining the two parameterized args (result and n_components) into a single arg that contains all valid combinations of result and n_components, which seems to me like it would hurt readability and flexibility. Or the other alternative seemed to be monkey patching some new functionality into pytest, which I also don't love.

Do you have any preference? I think I lean towards either leaving it as-is with duplicated tests or conditionally skipping the invalid tests, and I could live with either.

grovduck commented 1 year ago

Correct me if I'm wrong, but my understanding is that this would require combining the two parameterized args (result and n_components) into a single arg that contains all valid combinations of result and n_components, which seems to me like it would hurt readability and flexibility. Or the other alternative seemed to be monkey patching some new functionality into pytest, which I also don't love

In the thread you sent, I think this post was the closest I saw to a suggested solution. I'm playing around with it a bit - it looks like you use the pytest_collection_modifyitems hook to filter out the combinations you don't want. Would you consider that too magic?

If that's too much, I think I'll probably keep the duplicated tests in there, just because it sounds like the consensus is that skip is meant to denote a test that may pass at some point .

grovduck commented 1 year ago

Here's a worked example, copying most of what was in that post:

conftest.py
------------
import pytest

def pytest_configure(config):
    config.addinivalue_line(
        "markers",
        "uncollect_if(*, func): function to unselect tests from parametrization",
    )

def pytest_collection_modifyitems(config: pytest.Config, items: list[pytest.Item]):
    removed = []
    kept = []
    for item in items:
        m = item.get_closest_marker("uncollect_if")
        if m:
            func = m.kwargs["func"]
            if func(**item.callspec.params):
                removed.append(item)
                continue
        kept.append(item)
    if removed:
        config.hook.pytest_deselected(items=removed)
        items[:] = kept

test_port.py
------------
def uncollect_if(result, n_components):
    method, _ = result
    if n_components is not None and method not in ORDINATION_ESTIMATOR_RESULTS.keys():
        return True

@pytest.mark.uncollect_if(func=uncollect_if)
@pytest.mark.parametrize(
    "result", ESTIMATOR_RESULTS.items(), ids=ESTIMATOR_RESULTS.keys()
)
@pytest.mark.parametrize("n_components", [None, 3], ids=["full", "reduced"])
def test_kneighbors(result, n_components):
    """Test that the ported estimators identify the correct neighbors and distances."""
    method, estimator = result

    dataset = load_moscow_stjoes_results(method=method, n_components=n_components)

    hyperparams = (
        {"n_neighbors": 5}
        if method not in ORDINATION_ESTIMATOR_RESULTS.keys()
        else {"n_neighbors": 5, "n_components": n_components}
    )

    est = estimator(**hyperparams).fit(dataset.X_train, dataset.y_train)

    dist, nn = est.kneighbors(return_dataframe_index=True)
    assert_array_equal(nn, dataset.ref_neighbors)
    assert_array_almost_equal(dist, dataset.ref_distances, decimal=3)

    dist, nn = est.kneighbors(dataset.X_test, return_dataframe_index=True)
    assert_array_equal(nn, dataset.trg_neighbors)
    assert_array_almost_equal(dist, dataset.trg_distances, decimal=3)

with the corresponding output:


PS D:\code\scikit-learn-knn-regression> hatch run test:all tests/test_port.py::test_kneighbors -v
============================================================================= test session starts =============================================================================
platform win32 -- Python 3.10.1, pytest-7.3.1, pluggy-1.0.0 -- C:\Users\gregorma\AppData\Local\hatch\env\virtual\scikit-learn-knn-regression\47ek3DpI\test\Scripts\python.EXE
cachedir: .pytest_cache
rootdir: D:\code\scikit-learn-knn-regression
configfile: pyproject.toml
plugins: cov-4.0.0
collected 10 items / 3 deselected / 7 selected

tests/test_port.py::test_kneighbors[full-raw] PASSED                                                                                                                     [ 14%]
tests/test_port.py::test_kneighbors[full-euclidean] PASSED                                                                                                               [ 28%]
tests/test_port.py::test_kneighbors[full-mahalanobis] PASSED                                                                                                             [ 42%]
tests/test_port.py::test_kneighbors[full-gnn] PASSED                                                                                                                     [ 57%]
tests/test_port.py::test_kneighbors[full-msn] PASSED                                                                                                                     [ 71%]
tests/test_port.py::test_kneighbors[reduced-gnn] PASSED                                                                                                                  [ 85%]
tests/test_port.py::test_kneighbors[reduced-msn] PASSED                                                                                                                  [100%]

======================================================================= 7 passed, 3 deselected in 4.03s =======================================================================

aazuspan commented 1 year ago

~Ah, I glossed over that post because I was a bit confused by it, to be honest. I'm not very familiar with the pytest hook system, but it does look like it's probably the cleanest solution, overall. It does feel slightly magic, but it also seems like it's idiomatic for pytest.~

~If you want to adapt that for our use case, I'm happy with that solution!~

EDIT: Didn't notice your last post!

That looks great, thanks for implementing it! I don't understand exactly how the pytest_collection_modifyitems hook works, but I think I get the gist of it. Tests reporting as deselected rather than skipped makes more sense to me, and it's nice that they don't show up in the verbose output like the skipped tests did. I think this is a good strategy, and just have a few code suggestions for you to consider below:

Probably add a docstring to pytest_collection_modifyitems to explain what it's doing and link to the Github post.
In conftest.py, can we move the marker configuration into pyproject.toml? Defining the config programmatically feels a little odd, outside of a plugin approach where it needs to be patched in after-the-fact.
In test_port.py, can we rename uncollect_if to something specific, like estimator_does_not_support_ncomponents (or something shorter)? It would be even better if it could be inlined, but I imagine that would be too long to be easily readable?
In the same function above, what do you think about replacing the method not in ORDINATION_ESTIMATOR_RESULTS.keys() check with not hasattr(method, "n_components")? I think that accomplishes the same thing and would allow us to get rid of ORDINATION_ESTIMATOR_RESULTS, but I don't have a strong feeling so let me know if there's a downside I'm not considering.

grovduck commented 1 year ago

Arrgh, I'm losing steam on this PR ... A couple of blockers/annoyances:

I tried using the new dictionary union operator (|), but it fails on 3.8. Then I tried this syntax instead d = dict("a"=1); d.update({"b", 2}), but now sourcery complains on pre-commit that I should be using the union operator. I think I'm almost ready to ditch sourcery as a pre-commit because it doesn't look like you can specify options via pyproject.toml and, because I'm using python 3.10 in development, it always wants me to incorporate the new features (same thing goes for the walrus operator in one case).
We can filter out the impossible combinations of (method, n_components), but in the test we also need to change the hyperpameters based on the estimator type as well. This means we're writing the same check (i.e. hasattr(estimator(), "n_components") both in the new estimator_does_not_support_n_components function as well as the test itself.
In order to use estimator_does_not_support_n_components in both test_kneighbors and test_predict, the function signature has to accomodate result and n_components from test_kneighbors, but also the weighted parameter from test_predict. For now, the function signature looks like def estimator_does_not_support_n_components(result, n_components, **kwargs), but I'm not sure that's the right approach.

I could use your feedback on these issues. The first is a blocker for actually pushing the code and it might be easier to provide feedback on the second two after I commit/push my new changes.

aazuspan commented 1 year ago

I tried using the new dictionary union operator (|), but it fails on 3.8. Then I tried this syntax instead d = dict("a"=1); d.update({"b", 2}), but now sourcery complains on pre-commit that I should be using the union operator. I think I'm almost ready to ditch sourcery as a pre-commit because it doesn't look like you can specify options via pyproject.toml and, because I'm using python 3.10 in development, it always wants me to incorporate the new features (same thing goes for the walrus operator in one case).

Wow, dict unions are new to me! I feel like I always miss out on the cool new features by maintaining backwards compatible packages...

Back to the issue at hand, I'm open to dropping sourcery from our pre-commit. I think I've accepted a few corrections from it that I was ambivalent about, but it's also pretty slow to run and the fact that any new contributor would have to sign up for an account to run our checks is a little off-putting. With the lack of configuration and the Python version issues, I'd say it's doing more harm than good.

As far as how we remove it, I'm totally open to you just disabling it in this PR and then cleaning up the rest of the pre-commit stuff in #41. That's probably not strictly the cleanest way to do it, but I think the alternative of doing #41 first would end up blocking this for a while and probably create a headache when you have to merge in all the changes that will stem from running the pyupgrade rules. Unless it feels too janky to you to combine this PR with a change to the pre-commit, that's the lazy approach I would suggest.

As recommended, I'll take a closer look at the other questions once we get to a resolution here!

grovduck commented 1 year ago

As far as how we remove it, I'm totally open to you just disabling it in this PR and then cleaning up the rest of the pre-commit stuff in #41. That's probably not strictly the cleanest way to do it, but I think the alternative of doing #41 first would end up blocking this for a while and probably create a headache when you have to merge in all the changes that will stem from running the pyupgrade rules. Unless it feels too janky to you to combine this PR with a change to the pre-commit, that's the lazy approach I would suggest.

Excellent, all sourcery code is currently commented out. With #41, I'll leave it to you for the removal (we'll probably want to remove the key as well?). black did some spacing formatting on .pre-commit-config.yaml, so hopefully that jives with how it formats on your end.

To your earlier code suggestions:

Probably add a docstring to pytest_collection_modifyitems to explain what it's doing and link to the Github post.

I've done this and added a link to the specific comment in the pytest issue

In conftest.py, can we move the marker configuration into pyproject.toml? Defining the config programmatically feels a little odd, outside of a plugin approach where it needs to be patched in after-the-fact.

Cool, I didn't know you could do this here, but I think I've done it correctly.

In test_port.py, can we rename uncollect_if to something specific, like estimator_does_not_support_ncomponents (or something shorter)? It would be even better if it could be inlined, but I imagine that would be too long to be easily readable?

I actually liked your naming suggestion for the function, so I kept it. I didn't inline it because we can use it for both tests (I think you meant to inlining the function within the test itself, right?)

In the same function above, what do you think about replacing the method not in ORDINATION_ESTIMATOR_RESULTS.keys() check with not hasattr(method, "n_components")? I think that accomplishes the same thing and would allow us to get rid of ORDINATION_ESTIMATOR_RESULTS, but I don't have a strong feeling so let me know if there's a downside I'm not considering.

I've done this as well and now ORDINATION_ESTIMATOR_RESULTS is no longer needed.

Hopefully, the two questions/concerns I raised earlier will now make sense. Thanks for keeping with me on this one - it's turned out to be a bit epic!