tanghaibao / goatools

Python library to handle Gene Ontology (GO) terms
BSD 2-Clause "Simplified" License
745 stars 212 forks source link

Different results when restricting background set to only include genes with at least one annotation term #259

Open Maryam-Haghani opened 1 year ago

Maryam-Haghani commented 1 year ago

Hi,

I performed the enrichment analysis using my background set, and repeated it restricting the background set to only include genes with at least one annotation term (based on annotation file that the analysis is using).

I realized that GOATOOLS takes into account all background set genes without considering whether or not each gene has an annotation term. As a result, the findings of these two analyses had different P-values and GO term significance levels.

I'm wondering to know why GOATOOLS does not apply this filter by default in order to do a more accurate enrichment study.

Thanks!

dvklopfenstein commented 1 year ago

Changing the background population genes will most likely result in different pvalues than if using the original background population genes. This is correct behavior.

If the background population genes are reduced by removing unannotated genes, the same should be done with the study genes.

Even with the reduction in both the population and study set of genes, the pvalues will still likely to be different than not removing any genes due to the random chance that the distribution of unannotated genes in the total population and the distribution of unannotated total study population will differ from gene set to gene set. This is expected behavior.

GOA Tools keeps all study genes and population genes by default. However, reseachers wishing to develop criteria to remove population genes are able to do so due to the GOA Tools architecture that separates managing the databases (GO ontology DAG and annotations) from running the statistical tests.

Please feel free to apply any filtering functions on the population genes, but also ensure the same filter is applied to the study gene sets.