theislab / scib

Benchmarking analysis of data integration tools
MIT License
294 stars 63 forks source link

Questions about graph connectivitiy metric #246

Closed HelloWorldLTY closed 3 years ago

HelloWorldLTY commented 3 years ago

What is the principle of this metric? I think for a better method, it should have lower gc in batch while higher gc in cell type. Thanks

LuckyMD commented 3 years ago

Hi @ChineseBest,

The graph connectivity metric is a very simple metric of batch removal. You can find it described in the preprint here. It essentially just checks how many cells of the same cell identity are directly connected in a knn graph after data integration. Methods that incorrectly integrate new batches will often create unconnected clusters of cells that should be merged. This metric evaluates those cases.

HelloWorldLTY commented 3 years ago

Thanks for your explanation. I have read your excellent paper but I still intend to check whether my understanding is correct. That is, if I choose batch name as labels, we may expect that the connectivity of graph is low. If we choose cell type name as labels, we may expect that the connectivity of graph is high. Am I correct? Thanks

LuckyMD commented 3 years ago

Ah, now I understand your question. So if you choose cell type name as labels, then it should be high, yes (this is the intended use). On the other hand, if you choose batch name as labels, then the score doesn't make any sense I think. With batch name as labels, if you have a very high score, then all cells of the batch are connected (this is generally bad, but it can also happen if you have a too high k in your kNN graph). In the best case, you would be evaluating what the largest connected cell cluster is in the 1 batch. If the integration is working well, then that's just the largest cell identity cluster. You can get lower scores if you have many batches, your batches are all integrated, but you have no more biological variation in your dataset. So a lower score is not necessarily a good thing here.

I would stick with using only cell type labels as "label" input for graph connectivity. Hope that helps!