Label Spreading: knn Kernel Variant differs from Reference Paper

markusdosch commented 6 years ago

According to label_propagation.py#L487, the reference paper for the Label Spreading implementation is Zhou et al.: Learning with local and global consistency (2004).

But there are two things not in line with the paper when using the knn kernel variant for Label Spreading:

(A) Self-loops in affinity matrix

According to the paper, the adjacency matrix should have no self-loops (W_ii = 0). But when taking a look on the relevant line in _get_kernel, L134 (called with y=None, see L516 for Label Spreading and L397 for Label Propagation, kneighbors_graph get's called with mode='connectivity', which results in include_self=True, according to the docs. Therefore, a more correct way would be calling it with include_self=False.

Note that this change would not only modify the kneighbors_graph affinity matrix used for Label Spreading, but also the one for Label Propagation. When looking at the reference paper for Label Propagation, Zhu et al.: Learning from labeled and unlabeled data with label propagation (2002), they seem not to require that there should be no-self loops. But in sklearn's other reference, Bengio et al.: In Semi-Supervised Learning (2006), mentioned here, they advocate forcing W_ii=0 anyways - so I would use include_self=False for Label Propagation's kneighbors_graph as well.

(B) Laplacian Matrix Calculation

In label_propagation#L517, we see that sparse.csgraph.laplacian gets called. The default value for this method's use_out_degree is False. This is not a problem if the affinity matrix is symmetric, but as knn graphs usually are not symmetric, use_out_degree=True should be used. One can better see this issue when getting the diagonal via sparse.csgraph.laplacian(..., normed=False, return_diag=True) - when use_out_degree=False, the diagonal values won't resemble the k from kneighbors_graph.

Summary

To conform with the reference paper, we should

In label_propagation#L517, call kneighbors_graph with include_self=False
In label_propagation#L134, call sparse.csgraph.laplacian with use_out_degree=True

If you agree with me, I can submit this small PR and fix the tests, in case one breaks. 😊

jnothman commented 6 years ago

These sound reasonable... I think we need to work out whether they justify a backwards-incompatible change, or whether we should use some parameter setting or similar to phase it in.

Ping @musically-ut

musically-ut commented 6 years ago

This would be a step in the right direction, but some of the core problems with kNN kernel will still remain (See #8008). Also, the reference papers start with connectivity matrices and do not describe how to construct them per-se; using kNNs to make them spase them is something they I have not seen being actively recommended at many places, with a notable exception:

Wang, Fei, and Changshui Zhang. "Label propagation through linear neighborhoods." IEEE Transactions on Knowledge and Data Engineering 20.1 (2008): 55-67.

I've been meaning to give the kNN kernels an overhaul for a while now, but each step counts. 👍

whether they justify a backwards-incompatible change

I think as long as the examples and the test cases do not break, these changes can be considered bug fixes rather than feature additions. So, personally, I do not think that they will need a parameter for flow control.

markusdosch commented 6 years ago

Also, the reference papers start with connectivity matrices and do not describe how to construct them per-se; using kNNs to make them spase them is something they I have not seen being actively recommended at many places, with a notable exception

Just to add on that - Bengio et al.: In Semi-Supervised Learning (2006), one of the current reference papers, explicitly mentions a k-nearest neighbor matrix a couple of times. Though, in one of the next sentences in the introduction, they assume W_ij to be given by a symmetric positive function. If I understand that correctly, that means they assume W to be symmetric - something that knn cannot directly deliver. Would have been great if they had elaborated on that a bit more 😅

hkoppen commented 5 years ago

I suggest giving mode : {‘connectivity’, ‘distance’} as an additional parameter for kNN in order to get the (un)weighted kNN-graph. Further, I suggest giving self_loops as an additional parameter.

Maybe one could also explain in the documentation that n_neighbours=1 for kNN is useless, since it will (at the moment) create a kNN-graph where the only edges are self-loops for every vertex. So basically n_neighbours=k is what I would call (currently directed) "(k-1)-nearest neighbour"-graph :D

jnothman commented 5 years ago

Improvements to the documentation in the first instance are very welcome.

scikit-learn / scikit-learn