Open stanleyjs opened 1 year ago
This is a great question and there may well be different views on this.
Suppose you are assessing the prioritization from algorithm X. To me, for the background gene set, I would include any gene that had a chance to be considered in algorithm X. This can be the expressed gene set after the single-cell QC that enters algorithm X.
In many studies, I think people just fall back to using the entire proteome since it is easier to implement, and it often just errs on the 'conservative' side. Since the reported enrichment often would get more significant if you restrict the background scope to a smaller set.
I agree. Your question about the recommended gene population background for gene ontology enrichment analyses is a fantastic question -- one we just faced ourselves while analyzing a single cell RNA data set while applying for a grant.
THE scRNA STUDY SET: We selected our gene study sets from the 1000 brain genes analyzed in single cell RNA expression results using pvalue<0.01 and min_logfold>2 when comparing the RNA expression of diseased astrocytes and healthy astrocytes.
THE scRNA POPULATION SET: The population gene set for the scRNA analysis was the 1000 genes examined in the scRNA expression analysis.
THE GOEA STUDY SET: The gene study sets for GOEA analyses was the same scRNA study set genes.
THE GOEA POPULATION SET: We used two population gene sets for the GOEA:
Both population sets are valid. We would report both results:
Both are interesting results and worth considering. Please note we separated generating the study set by analyzing scRNA expression results and using the study set in GOEAs.
One more note: You may be wondering why we used a pval of 0.05 for the GOEA analyse rather than the pval of 0.01 that we used on the scRNA expression analyses.
When writing the GOA Tools paper, I designed, architected and implemented a GOEA simulation and then interpreted the data in a GOA Tools simulation repo and found that:
Specificity: (are the GO terms that are declared significant correct?) did very well in finding almost all terms that were truly significant (a tiny amount of false positives)
Sensitivity: (did we miss finding any GO terms that are truly significant) suffered depending on:
To sum it up, the conclusion from analyzing the GOEA simulations is that GO terms found significant are likely to be correct and it is also likely that the GOEA may have missed a bunch of truly significant GO terms.
From this, I concluded pval<0.05 for the GOEA analyses was sufficient. Also, this is for preliminary research so we wanted to see as much as possible.
Here is the figure showing the simulation results from our GOA Tools manuscript:
I am looking to do some GO analysis on scRNA-seq data. We are validating an algorithm that produces interesting gene sets.
What is the best practice for constructing the background gene set? Should this just be all genes that were sequenced? Or all genes that were identified according to some threshold of the data (e.g., expressed in > 5 cells). Or all genes that were included in the downstream analysis? (e.g., for an analysis of subsets obtained from 2000 highly variable genes, the backround is the 2000 HVGs)