merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
440 stars 145 forks source link

anvi-setup-ncbi-cogs issue #1738

Closed meren closed 3 years ago

meren commented 3 years ago

[THE LATEST LATEST UPDATE FOR POSTERITY: THIS IS NOW RESOLVED THANKS TO PRs by @Ge0rges, #2110 and #2112 -- EVERYTHING YOU SEE BELOW IS HISTORY]

Latest update to this issue

We now realize that issues with anvi-setup-ncbi-cogs are related to your internet speed. Faster internet connections result in successful download of the files. Due to the technical setup of the NCBI servers, slow connections that take a very long time to download files are prematurely cut, resulting in broken files :/

Unfortunately there is nothing anvi'o can do about this unless we copy the NCBI resource and host it elsewhere, but I don't think that is an appropriate thing to do. If you have any comments or suggestions, please share below.


The rest of the text in this message is here for historical reasons. Please ignore it.


Why are you here?

Probably anvi'o sent you here so you can help us address this issue.

Summary

This is a problem we have not been able to address, so we decided to collect more data from people to understand this enigmatic error (example, or these: #1686, #1671, #1647) that usually happens around here in the code while running anvi-setup-ncbi-cogs:

Traceback (most recent call last):
  File "/home/australomics/anaconda3/envs/anvio-7/bin/anvi-setup-ncbi-cogs", line 47, in <module>
    setup.create()
  File "/home/australomics/anaconda3/envs/anvio-7/lib/python3.6/site-packages/anvio/cogs.py", line 617, in create
    self.setup_raw_data()
  File "/home/australomics/anaconda3/envs/anvio-7/lib/python3.6/site-packages/anvio/cogs.py", line 831, in setup_raw_data
    self.files[file_name]['func'](file_path, J(self.COG_data_dir, self.files[file_name]['formatted_file_name']))
  File "/home/australomics/anaconda3/envs/anvio-7/lib/python3.6/site-packages/anvio/cogs.py", line 660, in format_p_id_to_cog_id_cPickle
    COG = fields[6]
IndexError: list index out of range

Do we still need your files?

Yes, please send the full error message and the files anvi'o requested you to send in its error message below as a comment. Please don't forget to mention which operating system you are using and how did you install anvi'o.

Thank you for your patience.

watsonar commented 3 years ago

Hello there. Anvi'o did indeed send me here, with this error message as a gift!

Full command and error message, using anvio-dev:

$ anvi-setup-ncbi-cogs --reset

COG version ..................................: COG20
COG data source ..............................: The anvi'o default.
COG base directory ...........................: /Users/andrea/github/anvio/anvio/data/misc/COG

WARNING
===============================================
This program will remove everything in the COG data directory, then download and
reformat everything from scratch.

Downloaded successfully ......................: /Users/andrea/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv
Downloaded successfully ......................: /Users/andrea/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.def.tab
Downloaded successfully ......................: /Users/andrea/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/fun-20.tab
Downloaded successfully ......................: /Users/andrea/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.fa.gz
[28 Sep 21 20:43:05 Formatting protein ids to COG ids file] 99.96%                                                                                                     ETA: None

Config Error: Bad news :( While parsing a COG input file, anvi'o encountered an error (which
              said: [list index out of range]) while processing the line 191073 in your file.
              Where the fields in that file looked looked like this: ['GU3_RS11560',
              'GCF_000243075.1', 'WP_014292721.1', '55']. Sadly, this has been a long-standing
              and very annoying issue that anvi'o developers were unable to reproduce. If you
              would like to help us find a solution, please visit the issue located at
              https://github.com/merenlab/anvio/issues/1738. There you can copy-paste this
              error message and attach the file in question that is located on your disk at '/
              Users/andrea/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.co
              g.csv'.

The aforementioned file: cog-20.cog.csv

When I run the same command with anvio-7 I get the exact error message quoted in the summary of this issue. :)

meren commented 3 years ago

Thank you, @watsonar, and wow. This is crazy. This is the line 191,073 where the file ends:

image

It is abruptly ending at line 191,073, when my copy of this file has this many lines:

wc -l anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv
3,455,853 anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv

I think this is an issue related to the NCBI servers, where the download stops for no reason :(

watsonar commented 3 years ago

Thanks for looking into this, and that makes sense! For what it's worth, I ended up having success with this when I ran it on an AWS EC2 instance rather than my local. Identical anvi'o setups, same operating system, but download speeds on the EC2 instances are much faster than what I get using my home internet. So I suspect this problem, at least in my case, may have had something to do with internet connection/speed. :/

meren commented 3 years ago

This is great to know, Andrea. Thank you very much!

Ge0rges commented 1 year ago

On server's gigabit connection I seem to run into this issue. Sent here via discord when searching this error. Mine says all the files were successfully downloaded, posting for info.

Output of anvi-setup-ncbi-cogs -T 30 --reset:

COG version ..................................: COG20
COG data source ..............................: The anvi'o default.
COG base directory ...........................: /Accounts/gkanaan/github/anvio/anvio/data/misc/COG

WARNING
===============================================
This program will remove everything in the COG data directory, then download and
reformat everything from scratch.

Downloaded successfully ......................: /Accounts/gkanaan/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv
Downloaded successfully ......................: /Accounts/gkanaan/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.def.tab
Downloaded successfully ......................: /Accounts/gkanaan/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/fun-20.tab
Downloaded successfully ......................: /Accounts/gkanaan/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.fa.gz

Config Error: Something went wrong while decompressing the downloaded file :/ It is likely    
              that the download failed and only part of the file was downloaded. If you would 
              like to try again, please run the setup command with the flag `--reset`. Here is
              what the downstream library said: 'Error -3 while decompressing data: invalid   
              code lengths set'. 

Currently on anvio7.1-dev, self test runs fine.

meren commented 1 year ago

It is the same issue. NCBI quietly ends connection mid-download, anvi'o thinks the file is downloaded successfully, when it wasn't. It is not always about the bandwidth of the recipient. If the NCBI servers are too busy and the download doesn't go as fast it could, it is also terminated. Try again the next day, it works. Extremely frustrating.

Ge0rges commented 1 year ago

Do these files have retrieveable hashes that could allow anvio to give a clearer status? I understand your frustration!

meren commented 1 year ago

Yes, there is a file at ftp://ftp.ncbi.nih.gov//pub/COG/COG2020/data/checksums.md5.txt that contains the hashes of the files in the directory. It would've required a small addition to the COGsSetup class in anvio/cogs.py to let the user know that files in fact did not download successfully.

shanexuuu commented 1 year ago

So glad to try out version 8. Came across the same issue. I have tried manually downloading cog-20.cog.csv and checked its md5 before puting it into db directory. It seems work now.

#manually download
wget -c https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/cog-20.cog.csv
wget -c https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/checksums.md5.txt

#check md5
md5sum -c checksums.md5.txt

cp cog-20.cog.csv /usr/bin/miniconda3/envs/anvio-8/lib/python3.10/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/

#anivo-8
anvi-setup-ncbi-cogs -T 10 --just-do-it
#test
anvi-self-test --suite pangenomics
meren commented 1 year ago

I hope this time it gave you a meaningful error message though :)

ivagljiva commented 1 year ago

We've had a few more people experiencing issues with the COGs download (after the checksum addition), so I wanted to put the manual workaround solution here for people to find. I will also add it to the help page for anvi-setup-ncbi-cogs so that we can easily link to it when helping people with this problem.

Always getting checksum errors? Instructions for manual downloads of the COG data (for COG 2020)

If you have tried re-running anvi-setup-ncbi-cogs but are always getting checksum errors and are about to lose your mind, here is a set of commands that you can follow to manually download the data for the 2020 release of COGs without having to go through the setup program every time.

First, you will need to move to the directory where anvi'o expects to find the COG files. This location will depend on where conda and anvi'o are installed on your computer, but if you have the anvi'o environment loaded in your terminal, you can easily get there by running the following:

cd $CONDA_PREFIX/lib/python3.10/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/

The files that anvi'o needs to see in that folder are the following:

checksum.md5.txt cog-20.def.tab   fun-20.tab
cog-20.cog.csv   cog-20.fa.gz

Since you have already tried running anvi-setup-ncbi-cogs so many times, probably there are some of those files already in there. But the checksums of those files need to match those that are listed in the checksum.md5.txt file. For instance, if you look for cog-20.cog.csv inside the checksum file:

grep cog-20.cog.csv checksum.md5.txt

You will see the following line: 1bed944a61e0ec404669361fb69ae52d cog-20.cog.csv which indicates that the file's checksum should match exactly to 1bed944a61e0ec404669361fb69ae52d. If you run md5sum cog-20.cog.csv, you should see that exact string. If you don't see the same thing, it means the file has been incompletely downloaded, so it needs to be downloaded again. You can do it like this:

rm -rf cog-20.cog.csv

curl -O https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/cog-20.cog.csv

md5sum cog-20.cog.csv

Once you get a copy of the file with an exactly matching MD5 checksum, you can move on.

You should run md5sum on every file listed above (except for checksum.md5.txt), and check if it matches the corresponding string inside checksum.md5.txt. For any file with a non-matching MD5 checksum, you should download it using curl as we did above:

rm -rf [FILENAME THAT DOES NOT MATCH]
curl -O https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/[FILENAME THAT DOES NOT MATCH]

(make sure you change the file name at the end of the path to match the file that you need)

After you have all the files with matching checksums, you can leave the data folder, and then re-run anvi-setup-ncbi-cogs, which should now work perfectly using the manually downloaded files:

cd
anvi-setup-ncbi-cogs
Ge0rges commented 1 year ago

I would add that using diff is an easy way to compare the output of md5sum with the contents of the file (which one could read line by line and use a simple script to identify the files that need to be redownloaded).

ivagljiva commented 1 year ago

Good idea @Ge0rges , I added that as an option in the anvi-setup-ncbi-cogs documentation, which I will update online soon :)