Open markusdosch opened 6 years ago
These sound reasonable... I think we need to work out whether they justify a backwards-incompatible change, or whether we should use some parameter setting or similar to phase it in.
Ping @musically-ut
This would be a step in the right direction, but some of the core problems with kNN kernel will still remain (See #8008). Also, the reference papers start with connectivity matrices and do not describe how to construct them per-se; using kNNs to make them spase them is something they I have not seen being actively recommended at many places, with a notable exception:
Wang, Fei, and Changshui Zhang. "Label propagation through linear neighborhoods." IEEE Transactions on Knowledge and Data Engineering 20.1 (2008): 55-67.
I've been meaning to give the kNN kernels an overhaul for a while now, but each step counts. 👍
whether they justify a backwards-incompatible change
I think as long as the examples and the test cases do not break, these changes can be considered bug fixes rather than feature additions. So, personally, I do not think that they will need a parameter for flow control.
Also, the reference papers start with connectivity matrices and do not describe how to construct them per-se; using kNNs to make them spase them is something they I have not seen being actively recommended at many places, with a notable exception
Just to add on that - Bengio et al.: In Semi-Supervised Learning (2006), one of the current reference papers, explicitly mentions a k-nearest neighbor matrix a couple of times. Though, in one of the next sentences in the introduction, they assume W_ij to be given by a symmetric positive function. If I understand that correctly, that means they assume W to be symmetric - something that knn cannot directly deliver. Would have been great if they had elaborated on that a bit more 😅
I suggest giving mode : {‘connectivity’, ‘distance’} as an additional parameter for kNN in order to get the (un)weighted kNN-graph. Further, I suggest giving self_loops as an additional parameter.
Maybe one could also explain in the documentation that n_neighbours=1 for kNN is useless, since it will (at the moment) create a kNN-graph where the only edges are self-loops for every vertex. So basically n_neighbours=k is what I would call (currently directed) "(k-1)-nearest neighbour"-graph :D
Improvements to the documentation in the first instance are very welcome.
According to label_propagation.py#L487, the reference paper for the Label Spreading implementation is Zhou et al.: Learning with local and global consistency (2004).
But there are two things not in line with the paper when using the knn kernel variant for Label Spreading:
(A) Self-loops in affinity matrix
According to the paper, the adjacency matrix should have no self-loops (W_ii = 0). But when taking a look on the relevant line in _get_kernel, L134 (called with
y=None
, see L516 for Label Spreading and L397 for Label Propagation,kneighbors_graph
get's called withmode='connectivity'
, which results ininclude_self=True
, according to the docs. Therefore, a more correct way would be calling it withinclude_self=False
.Note that this change would not only modify the kneighbors_graph affinity matrix used for Label Spreading, but also the one for Label Propagation. When looking at the reference paper for Label Propagation, Zhu et al.: Learning from labeled and unlabeled data with label propagation (2002), they seem not to require that there should be no-self loops. But in sklearn's other reference, Bengio et al.: In Semi-Supervised Learning (2006), mentioned here, they advocate forcing W_ii=0 anyways - so I would use
include_self=False
for Label Propagation's kneighbors_graph as well.(B) Laplacian Matrix Calculation
In label_propagation#L517, we see that
sparse.csgraph.laplacian
gets called. The default value for this method'suse_out_degree
isFalse
. This is not a problem if the affinity matrix is symmetric, but as knn graphs usually are not symmetric,use_out_degree=True
should be used. One can better see this issue when getting the diagonal viasparse.csgraph.laplacian(..., normed=False, return_diag=True)
- whenuse_out_degree=False
, the diagonal values won't resemble thek
fromkneighbors_graph
.Summary
To conform with the reference paper, we should
kneighbors_graph
withinclude_self=False
sparse.csgraph.laplacian
withuse_out_degree=True
If you agree with me, I can submit this small PR and fix the tests, in case one breaks. 😊