mflamand / Bullseye

Bullseye analysis pipeline for DART-seq analysis
MIT License
12 stars 4 forks source link

Beta score question #9

Closed ckuenne closed 1 year ago

ckuenne commented 1 year ago

Can you give a recommendation considering the beta score parameter of Find_edit_site.pl --score? It's disabled by default and also not used in any examples or the manuscript.

According to the help.: "Filterering based on a score based on the probability of edinting based on the beta distribution of nucleotides in the edited matrix over the control matrix is also available with the --score option." and "score and filter reads based on probability of edit based on beta distribution curve. Sites with 10 fold higher probability in DART/control will be kept"

Is this deprecated?

mflamand commented 1 year ago

Hi,

I would say it is mostly deprecated. I initially implemented it based on the SAILOR pipeline used to map A-to-I editing sites (https://github.com/yeolab/sailor). I hoped that it could be a good metric to score sites.

There is however a differences between their implementation and mine. In both cases we determine the confidence that the observed editing is higher than a set threshold (in this case it defaults to 10%, but is changed based on set parameters) in a beta distribution. However, in Bullseye, I also do the same on the control data (which was not in the ADAR editing site study) and calculate the confidence that the control data is also significantly edited. To compare the DART and Control dataset, we can then do:

log10( confidence Dart/confidence Control)

to get a score that represents the likelihood of DART over Control on a log scale: score of 1, indicates 10x more likely etc. This can then be used to filter sites

However, in practice we found that other filters (minimum coverage, number of mutation, ratio over controls) are more intuitive, stringent and easier to change. This score could perhaps be useful to map sites at low coverage regions. If you want to experiment, you can easily add "--score 0.01", which calculate the score, but will omit filtering based on it. to filter based on a score, you could do : "--score 1" or "--score 10" to keep only sites 10 or 100 times more likely in DART sample vs control.

In any case, we have not used this score in our papers and I can't recommend any specific number.

Hopefully this answer your question.

ckuenne commented 1 year ago

that is very helpful, thanks!