tanghaibao / goatools

Python library to handle Gene Ontology (GO) terms
BSD 2-Clause "Simplified" License
783 stars 210 forks source link

p_value calculation with fisher_exact #262

Closed Maryam-Haghani closed 1 year ago

Maryam-Haghani commented 1 year ago

Hi,

I did an analysis based on GOATOOLS and got the significant GO-terms along with their corresponding uncorrected p_values. But, when I tried to compute the uncorrelated p_values of GO terms manually by the definition: The probability of seeing in such a randomly drawn set at least as many term-t genes as in the study set, that can be easily calculated as the upper tail of a hyper-geometric distribution, I get different results from GOATOOLS's. Searching through your code at goatools/pvalcalc.py, I noticed that you have used scipy.fisher_exact with default 'two-sided' alternative to compute uncorrected p-values. But, from my understanding of the p_value formula and how fisher_exact works to compute p_values, 'greater' should be used as the alternative for fisher_exact.

Also, from GOATOOLS analysis, I got some significant go terms with no corresponding study items (study_count = 0) having a small p_value; But by the definition of p-value, these GO terms should have p_value = 1. I tested computing p_values for these GO terms with no corresponding study_count using hypergeom.pmf method, as well as _scipy.fisherexact with 'greater' alternative, and got p_value = 1 But when using fisher_exact with 'two-sided' alternative parameter, got the same result as GOATOOL's.

p_value_hypergeom = hypergeom.pmf(np.arange(study_count, min(study_pop, pop_count)+1), pop_n, study_n, pop_count).sum()
# when study_count = 0 -> p_value_hypergeom =0

odd_ratio, fisher_p_value = stats.fisher_exact([[study_count, study_pop - study_count], [pop_count- study_count, pop_n- pop_count- ( study_n - study_count)]], alternative='greater')
# when study_count = 0 -> fisher_p_value =0

odd_ratio, fisher_p_value = stats.fisher_exact([[study_count, study_pop - study_count], [pop_count- study_count, pop_n- pop_count- ( study_n - study_count)]], alternative='two-sided')
# when study_count = 0 -> fisher_p_value = a small value, the same as GOATOOLS's result

I would appreciate it if you clarify that for me.

tanghaibao commented 1 year ago

@Maryam-Haghani

We opted for two-sided tests since we are actually testing both directions. This is a typical report from goatools.

...
      10 GO terms found significant (< 0.05=alpha) (  8 enriched +   2 purified): statsmodels fdr_bh
      57 study items associated with significant GO IDs (enriched)
       1 study items associated with significant GO IDs (purified)

Note that we record both "enriched" and "purified". We test that the proportions in the two groups are different, the directionality is then determined after the test.

Would your use case require testing one-sided? or would it be sufficient that you post-process to get only the "enriched" set?

Thanks for digging into this.

Maryam-Haghani commented 1 year ago

@tanghaibao Thank you for your answer. I got the point now! I will filter the results to get only enriched terms.