mckennalab / FlashFry

FlashFry: The rapid CRISPR target site characterization tool
Other
63 stars 10 forks source link

guideRNA ranking #29

Closed Huanle closed 2 years ago

Huanle commented 2 years ago

Hi @aaronmck ,

I have a question regarding the ranking of guideRNA sequences? In the results I saw: AggregateRankedScore_medianRank AggregateRankedScore_tranche AggregateRankedScore_topX. I roughly understand from the wiki introduction that these are aggregated ranking metrics. But how are they aggregated? What is AggregateRankedScore_tranche? Sometimes, AggregateRankedScore_medianRank and AggregateRankedScore_topX could be the `same, similar, or very different? What is the underlying cause?

Also, I found the guide-target free energy (--folding) computation does not change the results. Is this expected?

Thanks a lot. I like flashfry since it incorporates lots of scoring metrics and is very fast!

aaronmck commented 2 years ago

Hi @Huanle,

The aggregate score method is based on the Schulze method, which is a ranked preference voting scheme. The idea is that it can be really hard to combine different scoring methods with different distributions into a single 'score'. Instead, we consider the rank of each score and find the highest ranked (most preferred) targets.

The tranche groups these targets together into larger groups (best 1/4 of the data, next 1/4, etc). The median rank is simply the median of the ranks. The top X gives you the top 1000 targets in order (often in large datasets this is all you want). Sorry for the diverse scores here, it was intended for really large data sets where you generally just wan the best scoring targets across large parts of the genome.

Some scoring methods don't implement the RankedScore trait, as we don't know what the 'best' target is. For instance, free energy; some people may want specific values for this, and it wasn't clear what a good score was.

Good luck! -Aaron

Huanle commented 2 years ago

Thanks a lot @aaronmck .