tanghaibao / goatools

Python library to handle Gene Ontology (GO) terms
BSD 2-Clause "Simplified" License
749 stars 212 forks source link

AssertionError: **FATAL: NO TAXIDS: gene2go, When I use `taxids = [9541]` #218

Closed WinterFor closed 3 years ago

WinterFor commented 3 years ago

Hi, I am using GOATOOLS for GOEA, and follow the pipline from goea_nbt3102_group_results.ipynb.

But I had some wrong with using taxids = [9541] in objanno = Gene2GoReader("gene2go", taxids=[9541]). I had tried 9606, 10090, and they can work.

9541 is the taxid of crab-eating monkey(Macaca Fascicularis)

It shows below

goeaobj = get_goeaobj('fdr_bh')
geneids_study = geneid2symbol.keys()
goea_results_all = goeaobj.run_study(geneids_study)
goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < 0.05]
---------------------------------------------------------------------------
  EXISTS: go-basic.obo
go-basic.obo: fmt(1.2) rel(2021-05-01) 47,284 GO Terms
  EXISTS: gene2go
HMS:0:00:01.558794       0 annotations,      0 genes,      0 GOs, 0 taxids READ: gene2go 
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-42-1ae2b6ad2b5a> in <module>
----> 1 goeaobj = get_goeaobj('fdr_bh')
      2 geneids_study = geneid2symbol.keys()
      3 goea_results_all = goeaobj.run_study(geneids_study)
      4 goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < 0.05]

<ipython-input-41-347669f81ffd> in get_goeaobj(method)
     12     download_ncbi_associations() # Get ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
     13     # Read NCBI's gene2go. Store annotations in a list of namedtuples
---> 14     objanno = Gene2GoReader("gene2go", taxids=[9541])
     15     # Get associations for each branch of the GO DAG (BP, MF, CC)
     16     ns2assoc = objanno.get_ns2assc()

/home/user/miniconda3/lib/python3.7/site-packages/goatools/anno/genetogo_reader.py in __init__(self, filename, **kws)
     27         super(Gene2GoReader, self).__init__('gene2go', filename, **kws)
     28         # Each taxid has a list of namedtuples - one for each line in the annotations
---> 29         self.taxid2asscs = self._init_taxid2asscs()
     30 
     31     def get_ns2assc(self, taxid=None, **kws):

/home/user/miniconda3/lib/python3.7/site-packages/goatools/anno/genetogo_reader.py in _init_taxid2asscs(self)
    162         for ntanno in self.associations:
    163             taxid2asscs[ntanno.tax_id].append(ntanno)
--> 164         assert taxid2asscs, "**FATAL: NO TAXIDS: {F}".format(F=self.filename)
    165         return dict(taxid2asscs)
    166 

AssertionError: **FATAL: NO TAXIDS: gene2go

Thanks for your help!

enushi commented 3 years ago

I had similar issue. Try deleting the GO and annotation files and run the script again. It might be the case that they were not correctly downloaded due to server/internet problems.

WinterFor commented 3 years ago

I had similar issue. Try deleting the GO and annotation files and run the script again. It might be the case that they were not correctly downloaded due to server/internet problems.

@enushi Thanks for your advice, But It does not work for me. I deleted go-basic.obo and gene2go, and run the script again. It shows Error -3 while decompressing data: invalid stored block lengths.

requests.get(http://purl.obolibrary.org/obo/go/go-basic.obo, stream=True)
  WROTE: go-basic.obo

go-basic.obo: fmt(1.2) rel(2021-06-16) 47,230 GO Terms
FTP RETR ftp.ncbi.nlm.nih.gov gene/DATA gene2go.gz -> gene2go.gz
  gunzip gene2go.gz
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-11-1ae2b6ad2b5a> in <module>
----> 1 goeaobj = get_goeaobj('fdr_bh')
      2 geneids_study = geneid2symbol.keys()
      3 goea_results_all = goeaobj.run_study(geneids_study)
      4 goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < 0.05]

<ipython-input-8-347669f81ffd> in get_goeaobj(method)
     10     obodag = GODag("go-basic.obo")
     11     # Load Associations
---> 12     download_ncbi_associations() # Get ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
     13     # Read NCBI's gene2go. Store annotations in a list of namedtuples
     14     objanno = Gene2GoReader("gene2go", taxids=[9541])

/home/user/miniconda3/lib/python3.7/site-packages/goatools/base.py in download_ncbi_associations(gene2go, prt, loading_bar)
    131         file_remote = "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/{GZ}".format(
    132             GZ=os.path.basename(gzip_file))
--> 133         dnld_file(file_remote, gene2go, prt, loading_bar)
    134     else:
    135         if prt is not None:

/home/user/miniconda3/lib/python3.7/site-packages/goatools/base.py in dnld_file(src_ftp, dst_file, prt, loading_bar)
    221             if prt is not None:
    222                 prt.write("  gunzip {FILE}\n".format(FILE=dst_wget))
--> 223             gzip_open_to(dst_wget, dst_file)
    224     except IOError as errmsg:
    225         import traceback

/home/user/miniconda3/lib/python3.7/site-packages/goatools/base.py in gzip_open_to(fin_gz, fout)
    233     with gzip.open(fin_gz, 'rb') as zstrm:
    234         with  open(fout, 'wb') as ostrm:
--> 235             ostrm.write(zstrm.read())
    236     assert os.path.isfile(fout), "COULD NOT GUNZIP({G}) TO FILE({F})".format(G=fin_gz, F=fout)
    237     os.remove(fin_gz)

/home/user/miniconda3/lib/python3.7/gzip.py in read(self, size)
    285             import errno
    286             raise OSError(errno.EBADF, "read() on write-only GzipFile object")
--> 287         return self._buffer.read(size)
    288 
    289     def read1(self, size=-1):

/home/user/miniconda3/lib/python3.7/gzip.py in read(self, size)
    480             buf = self._fp.read(io.DEFAULT_BUFFER_SIZE)
    481 
--> 482             uncompress = self._decompressor.decompress(buf, size)
    483             if self._decompressor.unconsumed_tail != b"":
    484                 self._fp.prepend(self._decompressor.unconsumed_tail)

error: Error -3 while decompressing data: invalid stored block lengths

Then I try to manual download gene2go from NCBI download page. I run the script again, It shows AssertionError: **FATAL: NO TAXIDS: gene2go again.

dvklopfenstein commented 3 years ago

Hello @WinterFor,

Thank you for interest in GOA TOOLs and for taking the time to write to us. Thank you @enushi for contributing to helping in this issue, which is very much appreciated.

CONCLUSION: crab-eating macaque monkey (9541) is not in NCBI's gene2go annotation file.

ACTION ITEM: @WinterFor, can you find another potential source for the macaque monkey annotations? NCBI does not support taxid 9541 in the their gene2go annotations. When you find the annotations, they will likely not be in gene2go format, but may be in the gaf or gpad format. If the monkey annotations are not in any of those annotations, you can write a short script to convert into either our simple format or the gaf or gpad format. Please open a new issue if you need guidance about annotation conversions.

Details

I download the latest gene2go file:

$ rm -f gene2go*
$ wget ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz
$ gunzip gene2go.gz

The first column contains the taxid. I see that there are 42 taxids (the 43 returned by the awk includes the taxid header):

$ awk '{print $1}' gene2go | sort | uniq -c | sort -r | wc -l
43

The top three taxid with the most annotations is: mouse (10090) with 414,017 annotations; rat(10116) with 389,923 annotations, and human(9606) with 341,883 annotations :

$ awk '{print $1}' gene2go | sort | uniq -c | sort -r
 414017 10090
 389023 10116
 341882 9606
...

Crab-eating macaque monkey (9541) is not in NCBI's gene2go annotation file as shown by the 1st awk command returning nothing. To test that I am running the command correctly, in the 2nd awk, I look for human (9606) and see the expected 341,883 annotations.

$ awk '{print $1}' gene2go | sort | uniq -c | grep 9541

$ awk '{print $1}' gene2go | sort | uniq -c | grep 9606
 341882 9606
tanghaibao commented 3 years ago

@dvklopfenstein awesome analysis. might be worth performing a check in code and display the appropriate error when taxid is not found

WinterFor commented 3 years ago

I will try to find the macaque monkey annotations. Thanks for your help! @dvklopfenstein @enushi

dvklopfenstein commented 3 years ago

@tanghaibao, I was thinking the same thing regarding improving the error message...