How to add custom background type

vseplyarskiy / Roulette

A mutation rate model at the basepair resolution identifies the mutagenic effect of Polymerase III transcription

7 stars 1 forks source link

How to add custom background type #1

Closed algaebrown closed 3 months ago

algaebrown commented 3 months ago

Hi Vladimir,

Thanks for making this awesome model. I am very impressed and interested in the results.

In the readme, you mentioned 2 types of background, synonymous mutations and non-coding ones.

https://github.com/vseplyarskiy/Roulette/tree/main/adding_mutation_rate

In the paper, you excluded TFBS. I wonder if I were to exclude a set a sites from non-coding background, how would I do it?

Thank you!

Charlene

vseplyarskiy commented 3 months ago

It is strongly depends on what is your objective and how much data do you have. I would recommend to exclude all but high quality sites from re-calibration procedure, which is always right step. Than it depends. If you are using populational data (gnomAD, UKBB...), than I would I exclude selected variants: coding sites, maybe sites with substantial non-coding constraint (high PhyloP, high PhastCons). For de novo data in context of some condition I would exclude mutations that could be associated with this condition (e.g. missense and non sense mutations), but could be also some other categories that you are suspecting.
I hope this make sense. If not let me know

algaebrown commented 3 months ago

Hi Vladimir,

Thanks for the timely reply. I study RNA-binding proteins(RBP). I am doing population stuffs, using gnomAD mostly.

I've been debating over the following options:

(1) using 'intronic regions with no RBP binding site/no TF binding sites and no low filters, SFS filters to estimate the scaling rate (2) using intergenic regions: everything - gene - TF binding sites

Which of the above do you think makes more sense?

vseplyarskiy commented 3 months ago

Honestly, both options seems good to me and should provide nearly identical results. I would edge toward introns because you will study transcribed regions, but again should be close.

algaebrown commented 3 months ago

I see! Thank you!