slzhao / RnaSeqSampleSize

1 stars 2 forks source link

Number of prognistic genes in RnaSeqSampleSize #2

Closed matmu closed 2 years ago

matmu commented 3 years ago

I am using the est_power_distribution() function RnaSeqSampleSize for estimating the power for different sample sizes for an RNAseq experiment. What is not clear to me is on what basis to chose m1 which is the "expected number of prognostic genes". Is it the number of significant genes I identified in my own dataset? For the example below, the power increases when I set a higher m1, e.g. 3 or 4. Why is it like that?


# Estimate the gene read count and dispersion distribution
dataMatrixDistribution = est_count_dispersion(counts, group=groups)

# Power estimation by read count and dispersion distribution
est_power_distribution(
            n = 20, # Number of samples in each group
            m = 12, # Total number of genes for testing
            m1 =2, # Expected number of prognostic genes
            f = 0.05, # FDR level
            rho = 2, # Minimum fold changes for prognostic genes between two groups
            repNumber = 100, 
            minAveCount = 5, # Minimal average read count for each gene. Genes with smaller read counts will not be used
            selectedGenes = selected_genes, # 11 genes
            distributionObject = dataMatrixDistribution)
slzhao commented 2 years ago

Hello matmu,

m1 is expected number of differential genes. You can get it by experience (for example, cell line getting less than 1% diff genes and human tissue samples getting more than 1% diff genes), Or doing differential detection in your prior data.

For power changes with m1 changes, it is related to FDR. For example, if you want to control 1% FDR, if you have 10,000 genes in total and 1% differential genes and 80% power, you can identify 80 true differential genes, or if you have 10% differential genes and 80% power, you can identify 800 true differential genes. To get the same FDR (1%), you will need less false positive genes with 80 true differential genes than with 800 true differential genes. So you will need smaller alpha to control type 1 error (false positive genes), which means you have less power with 1% differential genes than 10% differential genes.