tanghaibao / goatools

Python library to handle Gene Ontology (GO) terms
BSD 2-Clause "Simplified" License
749 stars 212 forks source link

I got confusing results during GOEA #220

Open WinterFor opened 3 years ago

WinterFor commented 3 years ago

Hi, After #218, I found the macaque monkey annotation in gaf format from QuickGo. the macaque monkey annotation macaca_fascicularis_from_quick_go_annotations.gaf Download

def get_goeaobj_gaf(file_name, method = 'fdr_bh'):
    obo_fname = download_go_basic_obo()
    obodag = GODag("go-basic.obo")
    # Load and Read Associations,  Store annotations in a list of namedtuples
    ogaf = GafReader(file_name)
    # Get associations for each branch of the GO DAG (BP, MF, CC)
    ns2assoc = ogaf.get_ns2assc()
    # GOE Object holds Ontologies, Associations, and Background gene set
    return GOEnrichmentStudyNS(
        GeneID2nt_monkey.keys(), # Background gene set: mouse protein-coding genes
        ns2assoc, # geneid/GO Associations for BP, MF, anc CC GODAG branches
        obodag, # Ontologies
        propagate_counts = False,
        alpha = 0.05, # default significance cut-off
        methods = [method]) # defult multipletest correction method

goeaobj = get_goeaobj_gaf('macaca_fascicularis_from_quick_go_annotations.gaf')

Then run without error. But the result is confusing.

  EXISTS: go-basic.obo
go-basic.obo: fmt(1.2) rel(2021-06-16) 47,230 GO Terms
HMS:0:00:03.772319 184,000 annotations READ: macaca_fascicularis_from_quick_go_annotations.gaf 

Load BP Gene Ontology Analysis ...
Propagating term counts up: is_a
  0%      0 of 20,630 population items found in association

Load CC Gene Ontology Analysis ...
Propagating term counts up: is_a
  0%      0 of 20,630 population items found in association

Load MF Gene Ontology Analysis ...
Propagating term counts up: is_a
  0%      0 of 20,630 population items found in association

Why are all 0%? Is this a reasonable result? Thanks!

dvklopfenstein commented 3 years ago

Hello @WinterFor,

It is great that you found annotations for the macaque monkey. What do the names look like in your population file?

WinterFor commented 3 years ago

Hi, @dvklopfenstein

Do you mean the population file refers to the file containing background genes from NCBI?

I get this file use query "9541"[Taxonomy ID] AND alive[property] AND genetype protein coding[Properties] link

And I import it by from genes_ncbi_9541_proteincoding import GENEID2NT as GeneID2nt_monkey


print(list(GeneID2nt_monkey.items())[:10])
[(7857739, ntncbi(tax_id=9541, Org_name='Macaca fascicularis', GeneID=7857739, CurrentID=0, Status='live', Symbol='COX1', Aliases=['KEG98_p11'], description='cytochrome c oxidase subunit I', other_designations='cytochrome c oxidase subunit I', map_location='', chromosome='MT', genomic_nucleotide_accession_version='NC_012670.1', start_position_on_the_genomic_accession=5858, end_position_on_the_genomic_accession=7396, orientation='plus', exon_count=0, OMIM=[], no_hdr0='')), 
(7857740, ntncbi(tax_id=9541, Org_name='Macaca fascicularis', GeneID=7857740, CurrentID=0, Status='live', Symbol='COX2', Aliases=['KEG98_p10'], description='cytochrome c oxidase subunit II', other_designations='cytochrome c oxidase subunit II', map_location='', chromosome='MT', genomic_nucleotide_accession_version='NC_012670.1', start_position_on_the_genomic_accession=7540, end_position_on_the_genomic_accession=8223, orientation='plus', exon_count=0, OMIM=[], no_hdr0='')),
(7857741, ntncbi(tax_id=9541, Org_name='Macaca fascicularis', GeneID=7857741, CurrentID=0, Status='live', Symbol='ATP8', Aliases=['KEG98_p09'], description='ATP synthase F0 subunit 8', other_designations='ATP synthase F0 subunit 8', map_location='', chromosome='MT', genomic_nucleotide_accession_version='NC_012670.1', start_position_on_the_genomic_accession=8364, end_position_on_the_genomic_accession=8570, orientation='plus', exon_count=0, OMIM=[], no_hdr0='')),
(7857742, ntncbi(tax_id=9541, Org_name='Macaca fascicularis', GeneID=7857742, CurrentID=0, Status='live', Symbol='ATP6', Aliases=['KEG98_p08'], description='ATP synthase F0 subunit 6', other_designations='ATP synthase F0 subunit 6', map_location='', chromosome='MT', genomic_nucleotide_accession_version='NC_012670.1', start_position_on_the_genomic_accession=8525, end_position_on_the_genomic_accession=9205, orientation='plus', exon_count=0, OMIM=[], no_hdr0='')),
(7857743, ntncbi(tax_id=9541, Org_name='Macaca fascicularis', GeneID=7857743, CurrentID=0, Status='live', Symbol='COX3', Aliases=['KEG98_p07'], description='cytochrome c oxidase subunit III', other_designations='cytochrome c oxidase subunit III', map_location='', chromosome='MT', genomic_nucleotide_accession_version='NC_012670.1', start_position_on_the_genomic_accession=9205, end_position_on_the_genomic_accession=9988, orientation='plus', exon_count=0, OMIM=[], no_hdr0=''))]

Do you mean the "names" refer to "Gene ID" in the background file?

dvklopfenstein commented 3 years ago

Hello @WinterFor ,

The gaf file contains IDs that are UniProt protein IDs such as A0A023JBW5.

But your population file contains NCBI GeneIDs. You will need to convert the NCBI GeneIDs such as 7857739 to protein IDs or the other way around. You could do this by writing a script which used the gene Symbol in GeneID2nt_monkey to match NCBI GeneIDs and then by using the DB_ID in the gaf file to match the UniProt protein IDs.

It is a bit of work to get your scripts set up for a species that is not ubiquitous in most GO databases. But once they are set up, then you can do Gene Ontology Enrichment Analyses on all kinds of species.

You are doing a great job with all this. I can't wait to read your papers on macaque monkies.

dvklopfenstein commented 3 years ago

I just want to add your QuickGO search for others that may need to search for species:

The QuickGO search was: https://www.ebi.ac.uk/QuickGO/annotations?downloadLimit=483104&taxonId=9541&taxonUsage=descendants

WinterFor commented 3 years ago

@dvklopfenstein I may find another time to finish this script. Thanks for your help all time.