theislab / scib

Benchmarking analysis of data integration tools
MIT License
294 stars 63 forks source link

Input data without cell annotation #288

Closed XuezhenChen closed 2 years ago

XuezhenChen commented 2 years ago

Dear authors,

Thanks for the great work! The scIB tool has been a great resource for our team. We're trying to evaluate batch removal effects on several integrated datasets. So far scib-pipeline/scripts/metrics/metrics.py worked well. However, in metrics.py --label_key is required. I'm wondering if it's reasonable to input data without annotated cell label for metrics calculation? (We would like to evaluate batch effect correction before we move into manual annotation.)

I've considered the followings:

  1. Does it make sense if we use cluster labels as label_key?
  2. Using metrics that does not require label_key only (e.g. from scIB.metrics import kbet) and do the calculation separately.

I'm new to this area and it would be great if you could give me some advice. Thanks for your time!

LuckyMD commented 2 years ago

Hi @XuezhenChen,

Thanks for the kind words. There are a few metrics you can use without labels. The kBET metric we adapted to use labels though so that we can correct for cell type composition differences. PCR_batch and graph iLISI are the batch removal metrics that don't require labels to run. On the bio conservation side, trajectory conservation, cell cycle conservation and HVG conservation don't require cell type labels.

Regarding replacing label_key with a clustering output... this is possible... but I would definitely verify the clusters. In the end the level of cluster annotation is what you will evaluate recovery of. If you base your clustering on one integrated embedding/graph, then you will bias all your evaluations towards that embedding. Better might be to cluster per batch and then map clusters to one another by correlation or marker genes.

Good luck!

XuezhenChen commented 2 years ago

Hello @LuckyMD,

Thanks for the clarification on kBET. We'll proceed with the metrics that don't require cell labels to run as you mentioned. Thanks again for your help!