Closed meren closed 3 years ago
Hello there. Anvi'o did indeed send me here, with this error message as a gift!
Full command and error message, using anvio-dev:
$ anvi-setup-ncbi-cogs --reset
COG version ..................................: COG20
COG data source ..............................: The anvi'o default.
COG base directory ...........................: /Users/andrea/github/anvio/anvio/data/misc/COG
WARNING
===============================================
This program will remove everything in the COG data directory, then download and
reformat everything from scratch.
Downloaded successfully ......................: /Users/andrea/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv
Downloaded successfully ......................: /Users/andrea/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.def.tab
Downloaded successfully ......................: /Users/andrea/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/fun-20.tab
Downloaded successfully ......................: /Users/andrea/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.fa.gz
[28 Sep 21 20:43:05 Formatting protein ids to COG ids file] 99.96% ETA: None
Config Error: Bad news :( While parsing a COG input file, anvi'o encountered an error (which
said: [list index out of range]) while processing the line 191073 in your file.
Where the fields in that file looked looked like this: ['GU3_RS11560',
'GCF_000243075.1', 'WP_014292721.1', '55']. Sadly, this has been a long-standing
and very annoying issue that anvi'o developers were unable to reproduce. If you
would like to help us find a solution, please visit the issue located at
https://github.com/merenlab/anvio/issues/1738. There you can copy-paste this
error message and attach the file in question that is located on your disk at '/
Users/andrea/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.co
g.csv'.
The aforementioned file: cog-20.cog.csv
When I run the same command with anvio-7 I get the exact error message quoted in the summary of this issue. :)
Thank you, @watsonar, and wow. This is crazy. This is the line 191,073
where the file ends:
It is abruptly ending at line 191,073
, when my copy of this file has this many lines:
wc -l anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv
3,455,853 anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv
I think this is an issue related to the NCBI servers, where the download stops for no reason :(
Thanks for looking into this, and that makes sense! For what it's worth, I ended up having success with this when I ran it on an AWS EC2 instance rather than my local. Identical anvi'o setups, same operating system, but download speeds on the EC2 instances are much faster than what I get using my home internet. So I suspect this problem, at least in my case, may have had something to do with internet connection/speed. :/
This is great to know, Andrea. Thank you very much!
On server's gigabit connection I seem to run into this issue. Sent here via discord when searching this error. Mine says all the files were successfully downloaded, posting for info.
Output of anvi-setup-ncbi-cogs -T 30 --reset
:
COG version ..................................: COG20
COG data source ..............................: The anvi'o default.
COG base directory ...........................: /Accounts/gkanaan/github/anvio/anvio/data/misc/COG
WARNING
===============================================
This program will remove everything in the COG data directory, then download and
reformat everything from scratch.
Downloaded successfully ......................: /Accounts/gkanaan/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv
Downloaded successfully ......................: /Accounts/gkanaan/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.def.tab
Downloaded successfully ......................: /Accounts/gkanaan/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/fun-20.tab
Downloaded successfully ......................: /Accounts/gkanaan/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.fa.gz
Config Error: Something went wrong while decompressing the downloaded file :/ It is likely
that the download failed and only part of the file was downloaded. If you would
like to try again, please run the setup command with the flag `--reset`. Here is
what the downstream library said: 'Error -3 while decompressing data: invalid
code lengths set'.
Currently on anvio7.1-dev
, self test runs fine.
It is the same issue. NCBI quietly ends connection mid-download, anvi'o thinks the file is downloaded successfully, when it wasn't. It is not always about the bandwidth of the recipient. If the NCBI servers are too busy and the download doesn't go as fast it could, it is also terminated. Try again the next day, it works. Extremely frustrating.
Do these files have retrieveable hashes that could allow anvio to give a clearer status? I understand your frustration!
Yes, there is a file at ftp://ftp.ncbi.nih.gov//pub/COG/COG2020/data/checksums.md5.txt that contains the hashes of the files in the directory. It would've required a small addition to the COGsSetup
class in anvio/cogs.py
to let the user know that files in fact did not download successfully.
So glad to try out version 8. Came across the same issue. I have tried manually downloading cog-20.cog.csv and checked its md5 before puting it into db directory. It seems work now.
#manually download
wget -c https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/cog-20.cog.csv
wget -c https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/checksums.md5.txt
#check md5
md5sum -c checksums.md5.txt
cp cog-20.cog.csv /usr/bin/miniconda3/envs/anvio-8/lib/python3.10/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/
#anivo-8
anvi-setup-ncbi-cogs -T 10 --just-do-it
#test
anvi-self-test --suite pangenomics
I hope this time it gave you a meaningful error message though :)
We've had a few more people experiencing issues with the COGs download (after the checksum addition), so I wanted to put the manual workaround solution here for people to find. I will also add it to the help page for anvi-setup-ncbi-cogs
so that we can easily link to it when helping people with this problem.
If you have tried re-running anvi-setup-ncbi-cogs
but are always getting checksum errors and are about to lose your mind, here is a set of commands that you can follow to manually download the data for the 2020 release of COGs without having to go through the setup program every time.
First, you will need to move to the directory where anvi'o expects to find the COG files. This location will depend on where conda and anvi'o are installed on your computer, but if you have the anvi'o environment loaded in your terminal, you can easily get there by running the following:
cd $CONDA_PREFIX/lib/python3.10/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/
The files that anvi'o needs to see in that folder are the following:
checksum.md5.txt cog-20.def.tab fun-20.tab
cog-20.cog.csv cog-20.fa.gz
Since you have already tried running anvi-setup-ncbi-cogs
so many times, probably there are some of those files already in there. But the checksums of those files need to match those that are listed in the checksum.md5.txt
file. For instance, if you look for cog-20.cog.csv
inside the checksum file:
grep cog-20.cog.csv checksum.md5.txt
You will see the following line: 1bed944a61e0ec404669361fb69ae52d cog-20.cog.csv
which indicates that the file's checksum should match exactly to 1bed944a61e0ec404669361fb69ae52d
. If you run md5sum cog-20.cog.csv
, you should see that exact string. If you don't see the same thing, it means the file has been incompletely downloaded, so it needs to be downloaded again. You can do it like this:
rm -rf cog-20.cog.csv
curl -O https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/cog-20.cog.csv
md5sum cog-20.cog.csv
Once you get a copy of the file with an exactly matching MD5 checksum, you can move on.
You should run md5sum
on every file listed above (except for checksum.md5.txt
), and check if it matches the corresponding string inside checksum.md5.txt
. For any file with a non-matching MD5 checksum, you should download it using curl
as we did above:
rm -rf [FILENAME THAT DOES NOT MATCH]
curl -O https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/[FILENAME THAT DOES NOT MATCH]
(make sure you change the file name at the end of the path to match the file that you need)
After you have all the files with matching checksums, you can leave the data folder, and then re-run anvi-setup-ncbi-cogs
, which should now work perfectly using the manually downloaded files:
cd
anvi-setup-ncbi-cogs
I would add that using diff
is an easy way to compare the output of md5sum
with the contents of the file (which one could read line by line and use a simple script to identify the files that need to be redownloaded).
Good idea @Ge0rges , I added that as an option in the anvi-setup-ncbi-cogs
documentation, which I will update online soon :)
[THE LATEST LATEST UPDATE FOR POSTERITY: THIS IS NOW RESOLVED THANKS TO PRs by @Ge0rges, #2110 and #2112 -- EVERYTHING YOU SEE BELOW IS HISTORY]
Latest update to this issue
We now realize that issues with
anvi-setup-ncbi-cogs
are related to your internet speed. Faster internet connections result in successful download of the files. Due to the technical setup of the NCBI servers, slow connections that take a very long time to download files are prematurely cut, resulting in broken files :/Unfortunately there is nothing anvi'o can do about this unless we copy the NCBI resource and host it elsewhere, but I don't think that is an appropriate thing to do. If you have any comments or suggestions, please share below.
The rest of the text in this message is here for historical reasons. Please ignore it.
Why are you here?
Probably anvi'o sent you here so you can help us address this issue.
Summary
This is a problem we have not been able to address, so we decided to collect more data from people to understand this enigmatic error (example, or these: #1686, #1671, #1647) that usually happens around here in the code while running
anvi-setup-ncbi-cogs
:Do we still need your files?
Yes, please send the full error message and the files anvi'o requested you to send in its error message below as a comment. Please don't forget to mention which operating system you are using and how did you install anvi'o.
Thank you for your patience.