Repairs chunk generation for unknown labels, solves issue #260

scikit-learn-contrib / metric-learn

Metric learning algorithms in Python

http://contrib.scikit-learn.org/metric-learn/

MIT License

1.4k stars 234 forks source link

Repairs chunk generation for unknown labels, solves issue #260 #263

Closed RobinVogel closed 4 years ago

RobinVogel commented 5 years ago

The discussions in pull request #254 were about the generations of chunks in constraints.py not providing a map of the indice of the element to the chunk in the presence of unknown labels. See the example below:

I'll just add an example to make it clearer. Below is a comparison in the presence of unknown labels between the output that I expected from Constraints.chunks, i.e. expected_chunk and the one that I obtain from Constraints.chunks, i.e. chunks:
In [1]: from metric_learn.constraints import Constraints                                                                                                                                                        

In [2]: partial_labels = [1, 2, 2, 1, -1, 3, 3]                                                        

In [3]: cons = Constraints(partial_labels)                                                             

In [4]: cons.chunks(num_chunks=3, chunk_size=2)                                                        
Out[4]: array([0, 1, 1, 0, 2, 2])

In [5]: chunks = cons.chunks(num_chunks=3, chunk_size=2)                                               

In [6]: len(chunks), len(partial_labels)                                                               
Out[6]: (6, 7)

In [7]: expected_chunk = [0, 1, 1, 0, -1, 2, 2]
The output is not a map such that expected_chunk[i] is the chunk of element i, which I expected, but might not matter in RCA.

This pull request solves that, which resolves #260 about RCA crashing when used with unknown labels. Since unknown labels seem to be handled in rca.py with a variable named chunk_mask I expect rca.py to work with unknown labels now.

I had to save the partial labels in constraints.py because I need to locate the locations of the unknown labels (label < 0) in partial_labels. I also corrected the remains of unittest in test_constraints.py.

RobinVogel commented 4 years ago

I wrote, the one on RCA in tests/metric_learn_test.py in the class TestRCA, and the one on the constraints in tests/constraints.py. I am interested in understanding why you chose these tests.

The tests of code coverage did not pass because of the fact that the method adjacency_matrix in Constraints of constraints.py is not tested, and I added a line in the function, which yields a negative coverage difference.

After a grep on all the .py, I noticed that the function has not been used anywhere in metric-learn. I don't see why it exists, to correct the pull request I can either:

Delete it all together,
Write a test for that function.

For the moment, I opted for deleting it.

bellet commented 4 years ago

I wrote, the one on RCA in tests/metric_learn_test.py in the class TestRCA, and the one on the constraints in tests/constraints.py. I am interested in understanding why you chose these tests.

The test on RCA is a natural one given the change made in this PR: it checks that unlabeled points are ignored. Note also that the test would fail in the previous version (since RCA would crash in the presence of unlabeled points). Generally when fixing a bug one should always write a test that would fail before to make sure the bug is not introduced again later without notice.

The test to check that no unlabeled points are assigned to any chunks is related to the fact that the change in this PR keeps entries corresponding to unlabeled points in the chunk array. Therefore it makes sense to check that there is no indexing issue, i.e. no such point is actually assigned to a chunk.

RobinVogel commented 4 years ago

I spent some thinking to understand why I always have a covariance matrix that is composed of NaN's in the context of my tests. This is caused by two points stacked up:

The default parameters for scikit-learn's make_classification involve a n_redundant=2 parameter, which means that one of the features is a linear combination of the others. So in that case, the algorithm will fit the data, but throw a useless NaN matrix. Using nothing but informative parameters solves this problem, see the documentation for more details.
The default value of chunk_size in RCA_supervised is 2, and at some point, one substract the average to the points in each chunks, e.g. if I have x1 and x_2 I replace them with (x1-x2)/2 and (x2-x1)/2. It means that with chunks of size 2, the maximum rank of my covariance matrix is num_chunks, which might be below the number of features when doing small scale tests.

I think both points should be addressed with proper errors. Should I raise an issue, or try and address those in this PR ?

bellet commented 4 years ago

I spent some thinking to understand why I always have a covariance matrix that is composed of NaN's in the context of my tests. This is caused by two points stacked up:

* The default parameters for scikit-learn's `make_classification` involve a `n_redundant=2` parameter, which means that one of the features is a linear combination of the others. So in that case, the algorithm will fit the data, but throw a useless NaN matrix. Using nothing but informative parameters solves this problem, see [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) for more details.

* The default value of `chunk_size` in `RCA_supervised` is 2, and at some point, one substract the average to the points in each chunks, e.g. if I have `x1` and `x_2` I replace them with `(x1-x2)/2` and `(x2-x1)/2`. It means that with chunks of size 2, the maximum rank of my covariance matrix is `num_chunks`, which might be below the number of features when doing small scale tests.

I think both points should be addressed with proper errors. Should I raise an issue, or try and address those in this PR ?

Good observations.

Note that we already have a warning which is raised in this case, advising the user to first reduce the dimension of the data:

UserWarning: The inner covariance matrix is not invertible, so the transformation matrix may contain Nan values. You should reduce the dimensionality of your input,for instance using `sklearn.decomposition.PCA` as a preprocessing step.

Regarding how to improve the errors:

IMO we should not check whether some features are linear combinations of others, this is too costly (and can be fixed by reducing the dimension). This is the responsability of the user.
It seems that what you describe for chunk_size=2 generalizes to larger chunk sizes, in the sense that we can have a general bound on the rank of the matrix as a function of chunk_size and num_chunks, i.e. when this is the case we know for sure it will fail based only on input parameters. If this is indeed the case then I agree we should throw a specific error and explain the bound to the user so she can adjust the parameters.

bellet commented 4 years ago

For the first point, we could simply adapt the current error:

UserWarning: The inner covariance matrix is not invertible, so the transformation matrix may contain Nan values. You should remove any linearly dependent features and/or reduce the dimensionality of your input, for instance using `sklearn.decomposition.PCA` as a preprocessing step.

RobinVogel commented 4 years ago

I did all the expected modifications and added the remark with the bound on the rank.