gtrichard / deepStats

deepStats: a stastitical toolbox for deeptools and genomic signals
GNU General Public License v3.0
32 stars 3 forks source link

bootstrap plots #16

Open LeilyR opened 4 years ago

LeilyR commented 4 years ago

Hi,

What are the width of regions on bootstrap plot? How the intervals are getting calculated? Also, Is there a way to do some statistic on the difference between 2 curves and return a p value?

gtrichard commented 4 years ago

Hi Leily,

What do you mean by width of regions ?

The intervals displayed on the graphics are calculated by bootstraps: it takes the deeptools computeMatrix output as input and it extracts with replacement rows of the matrix. After n bootstraps (extraction with replacement), the 5% and 95% bounds of the "bootstraped" (expected) distribution are extracted and those are displayed around the mean of the observed distribution.

For the statistics between 2 curves, there are two ways to perform the calculation: either way we take the local minima (most significant bin), or it's possible to aggregate the p-values of all bins or certain bins. I have a code that I need to implement.

I plan to make a new module specifically for 2 curves comparison with both plots (rank sum test and boostraps) + a p value (local minima or aggregated for selected bins).

If you have some data to play with, I can prepare that during the week.

LeilyR commented 4 years ago

Hi Gautier,

Thanks for the answer, Those 5% and 95% percents are what I meant. I didn't know which percentages were chosen to show the confidence interval. Also about the bootstrapping, how does it chose the number of regions to bootstrap? In my case I have two set of regions one with very few coordinates and the other with quite a lot of them. So it would have been sufficient if it bootstraps the larger set with the number of regions in the smaller set. However I think it bootstraps both to some number, right? I do have some data and it would be great if I could have a chance to run it on my data.

Cheers!

gtrichard commented 4 years ago

Let's say you have 20 regions in one set and 400 regions in the other set.

The bootstrap will extract with replacement 20 rows and 400 rows and compute the average per column n times (n being the number of bootstraps), respectively.

So there will be a higher confidence interval for the set with 20 regions for sure.

I've also started to implement another way of calculating p-values (instead of the wilcoxon rank-sum test) by using bootstraps z-scores, and then convert these z-scores to p-values, assuming a normal distribution of the expected values. But it also needs testing...

I'll implement these changes ASAP and let you know!

About the CI, you can set the one you want with the --bootstrapsCI parameter.

LeilyR commented 4 years ago

bootstraps 20 and 400 from the total number of regions regardless of which group they belong to?