Closed RobinVogel closed 4 years ago
I wrote, the one on RCA in tests/metric_learn_test.py
in the class TestRCA
, and
the one on the constraints in tests/constraints.py
. I am interested in understanding why you chose these tests.
The tests of code coverage did not pass because of the fact that the method adjacency_matrix
in Constraints
of constraints.py
is not tested, and I added a line in the function, which yields a negative coverage difference.
After a grep on all the .py
, I noticed that the function has not been used anywhere in metric-learn.
I don't see why it exists, to correct the pull request I can either:
For the moment, I opted for deleting it.
I wrote, the one on RCA in
tests/metric_learn_test.py
in the classTestRCA
, and the one on the constraints intests/constraints.py
. I am interested in understanding why you chose these tests.
The test on RCA is a natural one given the change made in this PR: it checks that unlabeled points are ignored. Note also that the test would fail in the previous version (since RCA would crash in the presence of unlabeled points). Generally when fixing a bug one should always write a test that would fail before to make sure the bug is not introduced again later without notice.
The test to check that no unlabeled points are assigned to any chunks is related to the fact that the change in this PR keeps entries corresponding to unlabeled points in the chunk array. Therefore it makes sense to check that there is no indexing issue, i.e. no such point is actually assigned to a chunk.
I spent some thinking to understand why I always have a covariance matrix that is composed of NaN
's in the context of my tests. This is caused by two points stacked up:
The default parameters for scikit-learn's make_classification
involve a n_redundant=2
parameter, which means that one of the features is a linear combination of the others. So in that case, the algorithm will fit the data, but throw a useless NaN matrix. Using nothing but informative parameters solves this problem, see the documentation for more details.
The default value of chunk_size
in RCA_supervised
is 2, and at some point, one substract the average to the points in each chunks, e.g. if I have x1
and x_2
I replace them with (x1-x2)/2
and (x2-x1)/2
. It means that with chunks of size 2, the maximum rank of my covariance matrix is num_chunks
, which might be below the number of features when doing small scale tests.
I think both points should be addressed with proper errors. Should I raise an issue, or try and address those in this PR ?
I spent some thinking to understand why I always have a covariance matrix that is composed of
NaN
's in the context of my tests. This is caused by two points stacked up:* The default parameters for scikit-learn's `make_classification` involve a `n_redundant=2` parameter, which means that one of the features is a linear combination of the others. So in that case, the algorithm will fit the data, but throw a useless NaN matrix. Using nothing but informative parameters solves this problem, see [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) for more details. * The default value of `chunk_size` in `RCA_supervised` is 2, and at some point, one substract the average to the points in each chunks, e.g. if I have `x1` and `x_2` I replace them with `(x1-x2)/2` and `(x2-x1)/2`. It means that with chunks of size 2, the maximum rank of my covariance matrix is `num_chunks`, which might be below the number of features when doing small scale tests.
I think both points should be addressed with proper errors. Should I raise an issue, or try and address those in this PR ?
Good observations.
Note that we already have a warning which is raised in this case, advising the user to first reduce the dimension of the data:
UserWarning: The inner covariance matrix is not invertible, so the transformation matrix may contain Nan values. You should reduce the dimensionality of your input,for instance using `sklearn.decomposition.PCA` as a preprocessing step.
Regarding how to improve the errors:
chunk_size=2
generalizes to larger chunk sizes, in the sense that we can have a general bound on the rank of the matrix as a function of chunk_size
and num_chunks
, i.e. when this is the case we know for sure it will fail based only on input parameters. If this is indeed the case then I agree we should throw a specific error and explain the bound to the user so she can adjust the parameters.For the first point, we could simply adapt the current error:
UserWarning: The inner covariance matrix is not invertible, so the transformation matrix may contain Nan values. You should remove any linearly dependent features and/or reduce the dimensionality of your input, for instance using `sklearn.decomposition.PCA` as a preprocessing step.
I did all the expected modifications and added the remark with the bound on the rank.
The discussions in pull request #254 were about the generations of chunks in
constraints.py
not providing a map of the indice of the element to the chunk in the presence of unknown labels. See the example below:This pull request solves that, which resolves #260 about RCA crashing when used with unknown labels. Since unknown labels seem to be handled in
rca.py
with a variable namedchunk_mask
I expectrca.py
to work with unknown labels now.I had to save the partial labels in
constraints.py
because I need to locate the locations of the unknown labels (label < 0) inpartial_labels
. I also corrected the remains ofunittest
intest_constraints.py
.