merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
423 stars 144 forks source link

[BUG] anvi-run-ncbi-cogs fails to run because checksum file is missing #2111

Closed dspeth closed 11 months ago

dspeth commented 1 year ago

Short description of the problem

anvi-run-ncbi-cogs fails to run because checksum file is missing

anvi'o version

anvio-dev installed via instructions and updated with every time the env is updated

System info

unix life sciences cluster of the university of vienna

Detailed description of the issue

When running anvi-run-ncbi-cogs, I encounted this error:

Config Error: At least one essential formatted file that is necesary for COG operations is not where it should be
('/lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/CHECKSUMS.txt'). You
should run COG setup, with the flag --reset if necessary, to make sure things are in order.

After running anvi-setup-ncbi-cogs with the --reset flag, neither the original file "checksum.md5.txt" nor the formatted file "CHECKSUMS.txt" seem to be present.

meren commented 1 year ago

Hey @dspeth, this is a new feature we're testing and I'll ping @Ge0rges here, who implemented this feature just recently.

When I tried to run it with the following command,

anvi-setup-ncbi-cogs --cog-data-dir TEST

and it run without an issue. Can you please confirm that you run it without specifying a --cog-data-dir flag (as most people will do)?

meren commented 1 year ago
anvi-setup-ncbi-cogs --reset

seems to be also working for me :/

It would've helped a lot if you were to rerun the command with --debug flag and send the entire output, @dspeth.

Thank you for your patience.

dspeth commented 1 year ago

@meren weird... I'm indeed running anvi-setup-ncbi-cogs --reset --debug without specifying a data dir

I did notice that it seems like the original checksum file gets removed by the last line of cogs.py, but otherwise don't know why the behaviour would be different for me than for you.

could have thought about running it with debug myself here's the debug output:

[speth@login01 ~]$ anvi-setup-ncbi-cogs --reset --debug
COG version ..................................: COG20
COG data source ..............................: The anvi'o default.
COG base directory ...........................: /lisc/user/speth/github/anvio/anvio/data/misc/COG

WARNING
===============================================
This program will remove everything in the COG data directory, then download and
reformat everything from scratch.

DOWNLOADING FILE
===============================================
Source URL ...................................: ftp://ftp.ncbi.nih.gov/pub/COG/COG2020/data/cog-20.cog.csv
Output path ..................................: /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv

Downloaded successfully ......................: /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv                                                                               

DOWNLOADING FILE
===============================================
Source URL ...................................: ftp://ftp.ncbi.nih.gov/pub/COG/COG2020/data/cog-20.def.tab
Output path ..................................: /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.def.tab

Downloaded successfully ......................: /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.def.tab                                                                               

DOWNLOADING FILE
===============================================
Source URL ...................................: ftp://ftp.ncbi.nih.gov/pub/COG/COG2020/data/fun-20.tab
Output path ..................................: /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/fun-20.tab

Downloaded successfully ......................: /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/fun-20.tab                                                                                   

DOWNLOADING FILE
===============================================
Source URL ...................................: ftp://ftp.ncbi.nih.gov//pub/COG/COG2020/data/checksums.md5.txt
Output path ..................................: /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/checksum.md5.txt

Downloaded successfully ......................: /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/checksum.md5.txt                                                                             

DOWNLOADING FILE
===============================================
Source URL ...................................: ftp://ftp.ncbi.nih.gov/pub/COG/COG2020/data/cog-20.fa.gz
Output path ..................................: /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.fa.gz

Downloaded successfully ......................: /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.fa.gz                                                                                 
Diamond log ..................................: /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/DB_DIAMOND/log.txt                                                                                              

DIAMOND MAKEDB
===============================================

[DEBUG] `run_command` is running .............: diamond makedb --in /tmp/tmpuk9my6tb -d /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/DB_DIAMOND/COG -p 1

Diamond search DB ............................: /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/DB_DIAMOND/COG.dmnd                                                                                             
BLAST log ....................................: /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/DB_BLAST/log.txt

NCBI BLAST MAKEDB
===============================================

[DEBUG] `run_command` is running .............: makeblastdb -in /tmp/tmpuk9my6tb -dbtype prot -out /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/DB_BLAST/COG/COG.fa

BLAST search db ..............................: /tmp/tmpuk9my6tb          
dspeth commented 1 year ago

@meren, seems like I can rescue this by just putting an empty file named CHECKSUMS.txt in the cog dir Still unclear why this not a problem for you

[speth@login01 test]$ anvi-run-ncbi-cogs -c CO36386bin_9_anvio.db -T 8
COG version ..................................: COG20
COG data source ..............................: The anvi'o default.
COG base directory ...........................: /lisc/user/speth/github/anvio/anvio/data/misc/COG
COG data directory ...........................: /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20
Searching with ...............................: diamond
Directory to store temporary files ...........: /tmp/tmpcpq1soow
Directory will be removed after the run ......: True

DIAMOND BLASTP
===============================================
Additional params for blastp .................: 
Search results ...............................: /tmp/tmpcpq1soow/diamond-search-results.txt                                                                                                                             

DIAMOND VIEW
===============================================
Diamond  tabular output file .................: /tmp/tmpcpq1soow/diamond-search-results.txt                                                                                                                             
COG version ..................................: COG20
COG data source ..............................: The anvi'o default.
COG base directory ...........................: /lisc/user/speth/github/anvio/anvio/data/misc/COG

WARNING
===============================================
Some functional annotation sources you wish to add to the database
(COG20_PATHWAY, COG20_FUNCTION, COG20_CATEGORY) are already in the database.
Anvi'o will first drop them so the incoming annotations could REPLACE them.

WARNING
===============================================
Dropping 3 functional annotation sources yo: COG20_PATHWAY, COG20_FUNCTION,
COG20_CATEGORY.

Gene functions ...............................: 7,883 function calls from 3 sources (COG20_PATHWAY, COG20_FUNCTION, COG20_CATEGORY) for 3,437 unique gene calls have been added to the contigs database.

✓ anvi-run-ncbi-cogs took 0:00:15.308512
[speth@login01 test]$ cd /lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/
[speth@login01 COG20]$ ls
total 38M
-rw-r--r-- 1 speth login  949 Aug 29 13:18 CATEGORIES.txt
-rw-r--r-- 1 speth login    0 Aug 29 13:31 CHECKSUMS.txt
-rw-r--r-- 1 speth login 391K Aug 29 13:18 COG.txt
drwxr-xr-x 3 speth login    4 Aug 29 13:19 DB_BLAST
drwxr-xr-x 2 speth login    4 Aug 29 13:18 DB_DIAMOND
-rw-r--r-- 1 speth login   72 Aug 29 13:20 MISSING_COG_IDs.cPickle
-rw-r--r-- 1 speth login  38M Aug 29 13:18 PID-TO-CID.cPickle
drwxr-xr-x 2 speth login    6 Aug 29 13:31 RAW_DATA_FROM_NCBI
[speth@login01 COG20]$ rm CHECKSUMS.txt 
[speth@login01 COG20]$ cd
[speth@login01 ~]$ cd test/
[speth@login01 test]$ anvi-run-ncbi-cogs -c CO36386bin_9_anvio.db -T 8
COG version ..................................: COG20
COG data source ..............................: The anvi'o default.
COG base directory ...........................: /lisc/user/speth/github/anvio/anvio/data/misc/COG

Config Error: At least one essential formatted file that is necesary for COG operations is not
              where it should be                                                              
              ('/lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/CHECKSUMS.txt'). You  
              should run COG setup, with the flag `--reset` if necessary, to make sure things 
              are in order.                                                                   

[speth@login01 test]$ anvi-run-ncbi-cogs -c CO36386bin_9_anvio.db -T 8 --debug
COG version ..................................: COG20
COG data source ..............................: The anvi'o default.
COG base directory ...........................: /lisc/user/speth/github/anvio/anvio/data/misc/COG

Traceback for debugging
#================================================================================
  File "/lisc/user/speth/github/anvio/bin/anvi-run-ncbi-cogs", line 65, in <module>
    main(args)
  File "/lisc/user/speth/github/anvio/anvio/terminal.py", line 881, in wrapper
    program_method(*args, **kwargs)
  File "/lisc/user/speth/github/anvio/bin/anvi-run-ncbi-cogs", line 35, in main
    cogs = COGs(args)
  File "/lisc/user/speth/github/anvio/anvio/cogs.py", line 78, in __init__
    self.initialize(args)
  File "/lisc/user/speth/github/anvio/anvio/cogs.py", line 91, in initialize
    self.essential_files = self.COG_setup.get_essential_file_paths()
  File "/lisc/user/speth/github/anvio/anvio/cogs.py", line 579, in get_essential_file_paths
    "are in order." % essential_files[file_name])
#================================================================================

Config Error: At least one essential formatted file that is necesary for COG operations is not
              where it should be                                                              
              ('/lisc/user/speth/github/anvio/anvio/data/misc/COG/COG20/CHECKSUMS.txt'). You  
              should run COG setup, with the flag `--reset` if necessary, to make sure things 
              are in order.                                                                   
meren commented 1 year ago

I'm so sorry, @dspeth. I just realized why anvi-run-ncbi-cogs is not working :/ While CHECKSUMS.txt is marked as an essential file, it was removed after the setup and I missed that :/ https://github.com/merenlab/anvio/commit/73b4d98eded3d54f1082c812fa5eb0167c6fadd9 fixes it, but it will require you to re-run the command,

anvi-setup-ncbi-cogs --reset

My apologies. I hope this will solve the issue.

meren commented 1 year ago

I have a better solution that I'm testing at the moment, but here is the longer story of what is happening, and an explanation for why https://github.com/merenlab/anvio/commit/73b4d98eded3d54f1082c812fa5eb0167c6fadd9 is not the best way to solve this:

The current setup marks the CHECKSUM.txt as an essential file:

(...)
'checksum.md5.txt': {  # No func as it is called by the setup_raw_data function
     'url': 'ftp://ftp.ncbi.nih.gov//pub/COG/COG2020/data/checksums.md5.txt',
     'type': 'essential',
     'formatted_file_name': 'CHECKSUMS.txt'},
(...)

Which means, at anytime when someone runs anvi-run-ncbi-cogs, the COGs class will get an instance of COGsSetup class which will try to make sure all essential files are in place .. and then it will complain that CHECKSUM.txt is not there, and will ask people to re-run anvi-setup-ncbi-cogs (even though it is not necessary for them to do so). The funny thing is, even if they did run anvi-setup-ncbi-cogs, the code was going to remove CHECKSUMS.txt, and they were going to be missing this essential file anyway.

The https://github.com/merenlab/anvio/commit/73b4d98eded3d54f1082c812fa5eb0167c6fadd9 solved that latter issue, but I'm now realizing that the best solution is to actually mark CHECKSUMS.txt as a non-essential file. This way anyone who already has a working COG setup (like Daan) will continue to use it seamlessly, and the checksum change will only affect people who are just starting to use anvio-dev. I've made that change now with https://github.com/merenlab/anvio/commit/dc0de1ef34e9a5c3243f9c41258d98d2b30c802a, and I think we should be fine :)

We apologize for all the anvio-dev users whose Snakemake workflows and slurm jobs that failed everywhere, and we thank everyone for sticking with anvio-dev anyway.

Best wishes,

Ge0rges commented 1 year ago

@meren thank you for handling this, I apologize for not recognizing the difference between the essential and non-essential parameter there. I did not completely realize the function of that keyword. @dspeth sorry for breaking your installation momentarily!

ivagljiva commented 11 months ago

I think this is a solved issue so I will close it :)