tanghaibao / goatools

Python library to handle Gene Ontology (GO) terms
BSD 2-Clause "Simplified" License
761 stars 211 forks source link

Low pVal/FDR #104

Open phillipeloher opened 6 years ago

phillipeloher commented 6 years ago

I'm getting a real low FDR/pVal for GO:0004930, where there is only 1 gene that matches. I'm nervous that either something is wrong or I'm passing in the wrong thing. Or could it be that perhaps there are matching children that contribute towards the significance even though the children aren't showing up? For what it's worth, DaviD is not surfacing this GO term. Thanks for any guidance.

goea_results_sig[3].pop_n 20913

goea_results_sig[3].study_count 1

goea_results_sig[3].study_n 500

goea_results_sig[3].study_items {1901}

goea_results_sig[3].pop_count 662

goea_results_sig[3].p_uncorrected 3.244847272457287e-06

goea_results_sig[3].p_fdr_bh 0.014270838304267149 goea_results_sig[3].goterm GOTerm('GO:0004930'): name:G-protein coupled receptor activity depth:4 id:GO:0004930 is_obsolete:False parents: 1 items GO:0004888 level-03 depth-03 transmembrane signaling receptor activity [molecular_function] level:4 alt_ids: 5 items GO:0001622 GO:0016526 GO:0001625 GO:0001623 GO:0001624 _parents: 1 items GO:0004888 namespace:molecular_function children: 41 items

phillipeloher commented 6 years ago

Forgot to mention, this is for human (taxonomy 9606)

dvklopfenstein commented 6 years ago

Dr. Loher,

Thank you taking the time to write to us regarding this. We have another open issue with the same question regarding large differences between uncorrected P-values and the corrected values.

We have stochastic simulation code to investigate sensitivity/specificity/FDR performances of GOEA runs over a variety of study sizes and percentage of True Nulls.

I would like to set up a simulation to study the situation that you and the other researcher are seeing. But I will need more information. Is it possible to supply us with the full set of files to recreate the GOEA run?

We follow recommended methodologies for the using the statistics functions. Our statistical analysis is quite "vanilla", so it is likely that all is fine. But you have an interesting question whose answer can impact GOATOOLS, so a closer look is warranted.

Thank you for taking the time to let us know your concerns and thank you for your interest in GOATOOLS.

phillipeloher commented 6 years ago

Thanks for the fast response and apologies for the delay. Perhaps even the pVal seems too low to me so don't think it's an issue necessarily with corrected values. Of 500 genes passed in, only n=1 (gene name: S1PR1) intersects with GO:0004930. That one gene led to an uncorrected pVal in the Xe-06 range. If I'm reading it correctly, GO:0004930 has n=662 members within a background of n=20913.

I would prefer to send sample-code and test input via email if that's possible. Please let me know if that would be helpful. Thanks again for your help in understanding this.

dvklopfenstein commented 6 years ago

Yes. Sending sample-code and test input via email would be very helpful. Please see my GitHub home page for my email.

It will be interesting to explore the large uncorrected/corrected pvalue difference.

Thank you for your interest in GOATOOLS and taking the time to write us regarding this interesting question.