Closed HelloWorldLTY closed 3 years ago
Hi @ChineseBest,
The graph connectivity metric is a very simple metric of batch removal. You can find it described in the preprint here. It essentially just checks how many cells of the same cell identity are directly connected in a knn graph after data integration. Methods that incorrectly integrate new batches will often create unconnected clusters of cells that should be merged. This metric evaluates those cases.
Thanks for your explanation. I have read your excellent paper but I still intend to check whether my understanding is correct. That is, if I choose batch name as labels, we may expect that the connectivity of graph is low. If we choose cell type name as labels, we may expect that the connectivity of graph is high. Am I correct? Thanks
Ah, now I understand your question. So if you choose cell type name as labels, then it should be high, yes (this is the intended use). On the other hand, if you choose batch name as labels, then the score doesn't make any sense I think. With batch name as labels, if you have a very high score, then all cells of the batch are connected (this is generally bad, but it can also happen if you have a too high k in your kNN graph). In the best case, you would be evaluating what the largest connected cell cluster is in the 1 batch. If the integration is working well, then that's just the largest cell identity cluster. You can get lower scores if you have many batches, your batches are all integrated, but you have no more biological variation in your dataset. So a lower score is not necessarily a good thing here.
I would stick with using only cell type labels as "label" input for graph connectivity. Hope that helps!
What is the principle of this metric? I think for a better method, it should have lower gc in batch while higher gc in cell type. Thanks