mflamand / Bullseye

Bullseye analysis pipeline for DART-seq analysis
MIT License
12 stars 4 forks source link

How to get C-to-U site confidence score in Bullseye for single cell RNAseq data? #14

Open ghost opened 1 year ago

ghost commented 1 year ago

Hi Mathieu, Bullseye is wonderful for single-cell RNAseq C-to-U site detection and the results conclude control coverage, control ratio, dart coverage, and dart ratio. How to get each C-to-U site confidence score in Bullseye for single-cell RNAseq data?

In the sailor pipeline, when detecting the C-to-U site, the final result can give the confidence score for each site: https://github.com/YeoLab/sailor. Here is the paper they calculate the confidence score: https://www.nature.com/articles/s41592-021-01128-0#data-availability

I think the edit ratio and confidence score together can help us get a better understanding of the data.

Looking forward to your reply. Best, Zongmin

mflamand commented 1 year ago

Hi Zongmin,

If I remember correctly, the nature method paper you're linked to used 2 scores, an epsilon score and the Sailor score. I am not sure which one you are referring to. I have not implemented their epsilon score in Bullseye and I would be unsure how to do so with the current pipeline.

As for the sailor score, I believe it represents the confidence that a detected site is a real one based on coverage, number of mutations. These parameters are already taken into account in Bullseye. However this score can be useful to rank sites, but you still need to set an arbitrary cutoff. It also does not account for a comparison between a control samples (APOBEC alone or YTHmut-APOBEC) and the DART sample.

There is a --score option in Find_edit_sites.pl to sort of get this score. When using it, it will calculate the confidence for sites in both DART sample and in control sample and then compare both as : log10(confidence in DART/confidence in Control). This would mean that anything above 1 is 10x more likely to be real in the DART samples than in the control. For single cell there may be large difference in coverage between each cell and the pseudobulk control sample, and I am unsure how this score would turnout.

Overall, when we try yo use it, we found that it did not improve the site selected and that the other parameters of Bullseye were able to replicate quite well filtering by score, with the advantage of being able to better know what we are changing. Of course in both cases we are setting arbitrary cutoffs so there may be biases.

Perhaps in the future I will look at the epsilon score,

Please let me know if that answer your question.

Best,