mflamand / Bullseye

Bullseye analysis pipeline for DART-seq analysis
MIT License
12 stars 4 forks source link

Statistics of the results #6

Closed mcortes-lopez closed 1 year ago

mcortes-lopez commented 2 years ago

I am trying to create a summary table with the site, total reads from the control and YTH, and score. The goal is to provide of a resource highlighting the sites with a higher confidence. However the output files are a bit confusing to interpret, given the multiple collapsing and filtering steps. Is there any multiple testing on the comparisons performed ? If I understand correctly, each replicate gets compared against each control and then collapsed. What exactly is the final score reflecting? Also, wouldn't it be required to have a multiple testing correction for all these comparisons?

mflamand commented 2 years ago

Hi, The current Bullseye pipeline is made to identify sites found in multiple replicates by comparing each sample to each negative control. After running summarize_site.pl, the output correspond to the average C2U editing rates in all samples where the site was called (and average background editing in control files). the coverage and number of mutation are the sum of all samples with a called site.

If the mutation rate in any control file (YTHmut, Mettl3 KO etc) is higher then in the DART experiment, then the site is not called in this sample. There are no statistical test involved here, so there is no multiple testing correction.

If you want to have a more granular control over the coverage, number of mutations and editing rates, I have added a set of additional scripts in the /Code/quant/ directory. These can be used as such:

1) identify a list of common editing sites as done previously

2) run quantify_sites.pl on each matrix.gz files (including control files) and the output file of bullseye containing all sites:


perl quantify_sites.pl --bed bullseye_output.bed \
--EditedMatrix file.matrix.gz \
--outfile file1.out 

3) This will generate a output file containing for each called site the editing rate, coverage and number of mutation in each sample. These data can be gathered into a matrix table where each row is a site and each column a sample with gather_site.pl


perl gather_sites.pl --outfile results.out -cov -mut -score *.out 

providing all files generated by quantify_sites.pl as input.

Using each of -cov, -mut and -score will generate a distinct output file named results.coverage.txt, results.mut.txt and results.score.txt respectively. Not all 3 options need to be used.

I usually use the -cov and -mut options to generate a matrix of coverage and mutations and then import this in R to calculate average across all samples (DART or DARTmut etc.). To compare editing rates across conditions, I fit the editing rates at each site in a generalized linear model in R (quasibinomial model) and extract significance using Wald test and adjust for multiple comparison using the FDR method.

Let me know if you want more detail on this pipeline.

mcortes-lopez commented 2 years ago

Hi,

Thanks a lot for your reply! Exactly that last part with the GLM model is what I wanted to try with my data. I would appreciate if your could share some scripts or details on that.

Thank you!

mflamand commented 2 years ago

Hi Mariela,

I added to example_data/markdown an Rmarkdown file and the R functions I use for the GLM. You will need to generate the files mentioned in my previous reply (coverage and mutation matrices). I don't have great documentation for these R function yet, and I add those later. Hopefully you can understand it. Let me know if you have other questions. I will be away next week and will answer the following week.