Closed algaebrown closed 3 months ago
It is strongly depends on what is your objective and how much data do you have. I would recommend to exclude all but high quality sites from re-calibration procedure, which is always right step. Than it depends. If you are using populational data (gnomAD, UKBB...), than I would I exclude selected variants: coding sites, maybe sites with substantial non-coding constraint (high PhyloP, high PhastCons). For de novo data in context of some condition I would exclude mutations that could be associated with this condition (e.g. missense and non sense mutations), but could be also some other categories that you are suspecting.
I hope this make sense. If not let me know
Hi Vladimir,
Thanks for the timely reply. I study RNA-binding proteins(RBP). I am doing population stuffs, using gnomAD mostly.
I've been debating over the following options:
(1) using 'intronic regions with no RBP binding site/no TF binding sites and no low filters, SFS filters to estimate the scaling rate (2) using intergenic regions: everything - gene - TF binding sites
Which of the above do you think makes more sense?
Honestly, both options seems good to me and should provide nearly identical results. I would edge toward introns because you will study transcribed regions, but again should be close.
I see! Thank you!
Hi Vladimir,
Thanks for making this awesome model. I am very impressed and interested in the results.
In the readme, you mentioned 2 types of background, synonymous mutations and non-coding ones.
https://github.com/vseplyarskiy/Roulette/tree/main/adding_mutation_rate
In the paper, you excluded TFBS. I wonder if I were to exclude a set a sites from non-coding background, how would I do it?
Thank you!
Charlene