michaelkyu / PlasX

PlasX, a machine learning classifier for identifying plasmid sequences based on genetic architecture
GNU General Public License v3.0
29 stars 1 forks source link

COG database issue #11

Open pdy1084 opened 2 months ago

pdy1084 commented 2 months ago

Hi PlasX team,

Thank you for providing this insightful software. I could manage to run PlasX over my data (set of open reading frames) with the Pfam database. However, when I try to include the download the COG database and proceed with anvi-setup-ncbi-cogs, anvi-run-ncbi-cogs and anvi-export-functions I encounter several errors (mentioned in the section "Terminal output").

Describe the bug

The following lines of code (->) give me the below error. If I comment these lines I can screen my data for the Pfam_v32 database successfully.

-> anvi-setup-ncbi-cogs --cog-version COG20 --cog-data-dir COG_2020 -T $THREADS --reset anvi-setup-pfams --pfam-version 32.0 --pfam-data-dir Pfam_v32 -T $THREADS --reset

Annotate COGs -> anvi-run-ncbi-cogs -T $THREADS --cog-version COG20 --cog-data-dir COG_2020 -c $PREFIX.db

Annotate Pfams anvi-run-pfams -T $THREADS --pfam-data-dir Pfam_v32 -c $PREFIX.db

Export functions to text file -> anvi-export-functions --annotation-sources COG20_FUNCTION,Pfam -c $PREFIX.db -o $PREFIX-cogs-and-pfams.txt anvi-export-functions --annotation-sources Pfam -c $PREFIX.db -o $PREFIX-pfams.txt

I tried changing COG20 for COG14 but still does not work.

Terminal output

------------------FINISHED GENE CALLING WITH PRODIGAL

Config Error: Something went wrong with your download attempt. Here is the problem for the url ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2014/data/cog2003-2014.csv: '<urlopen error [Errno 113] No route to host>'

Config Error: It seems you already have Pfam database installed in 'Pfam_v32', please use --reset flag if you want to re-download it.

Config Error: At least one essential formatted file that is necesary for COG operations is not where it should be ('.../results/COG_2014/COG14/PID-TO-CID.cPickle'). You should run COG setup, with the flag --reset if necessary, to make sure things are in order.

Config Error: One or more of the annotation sources you requested does not appear to be in the contigs database :/ Here is the list: COG14_FUNCTION.

Software environment

packages in environment at .../.conda/envs/plasx: Name Version Build Channel

_libgcc_mutex 0.1 main anaconda _openmp_mutex 5.1 1_gnu anaconda blas 1.0 openblas anaconda blosc 1.21.3 h6a678d5_0 anaconda bottleneck 1.3.7 py312ha883a20_0 anaconda bzip2 1.0.8 h5eee18b_6 anaconda ca-certificates 2024.7.2 h06a4308_0 anaconda expat 2.6.3 h6a678d5_0 anaconda gawk 5.1.0 h7b6447c_0 anaconda joblib 1.4.2 py312h06a4308_0 anaconda ld_impl_linux-64 2.38 h1181459_1 anaconda libffi 3.4.4 h6a678d5_1 anaconda libgcc-ng 11.2.0 h1234567_1 anaconda libgfortran-ng 11.2.0 h00389a5_1 anaconda libgfortran5 11.2.0 h1234567_1 anaconda libgomp 11.2.0 h1234567_1 anaconda libllvm14 14.0.6 hdb19cb5_3 anaconda libopenblas 0.3.21 h043d6bf_0 anaconda libstdcxx-ng 11.2.0 h1234567_1 anaconda libuuid 1.41.5 h5eee18b_0 anaconda llvm-meta 7.0.0 0 conda-forge llvmlite 0.43.0 py312h6a678d5_0 anaconda lz4-c 1.9.4 h6a678d5_1 anaconda mmseqs2 10.6d92c h2d02072_0 bioconda ncurses 6.4 h6a678d5_0 anaconda numba 0.60.0 py312h526ad5a_0 anaconda numexpr 2.8.7 py312he7dcb8a_0 anaconda numpy 1.26.4 py312h2809609_0 anaconda numpy-base 1.26.4 py312he1a6c75_0 anaconda openmp 7.0.0 h2d50403_0 conda-forge openssl 3.0.15 h5eee18b_0 anaconda pandas 2.2.2 py312h526ad5a_0 anaconda pip 24.2 py312h06a4308_0 anaconda plasx 0.0.0 pypi_0 pypi pybind11-abi 5 hd3eb1b0_0 anaconda python 3.12.5 h5148396_1 anaconda python-blosc 1.10.6 py312h526ad5a_0 anaconda python-dateutil 2.9.0post0 py312h06a4308_2 anaconda python-tzdata 2023.3 pyhd3eb1b0_0 anaconda pytz 2024.1 py312h06a4308_0 anaconda readline 8.2 h5eee18b_0 anaconda scikit-learn 1.5.1 py312h526ad5a_0 anaconda scipy 1.13.1 py312h2809609_0 anaconda setuptools 72.1.0 py312h06a4308_0 anaconda six 1.16.0 pyhd3eb1b0_1 anaconda sqlite 3.45.3 h5eee18b_0 anaconda tbb 2021.8.0 hdb19cb5_0 anaconda threadpoolctl 3.5.0 py312he106c6f_0 anaconda tk 8.6.14 h39e8969_0 anaconda tzdata 2024a h04d1e81_0 anaconda wheel 0.44.0 py312h06a4308_0 anaconda xz 5.4.6 h5eee18b_1 anaconda zlib 1.2.13 h5eee18b_1 anaconda zstd 1.5.5 hc292b87_2 anaconda

meren commented 2 months ago

You must be using a very old version of anvi'o for this to happen, @pdy1084. If you don't want to update your anv'oi, then you need to use COG14_FUNCTION instead of COG20_FUNCTION.

Please run anvi-db-info on your contigs database and take a look at the output to figure out which function annotation sources are available to you.

pdy1084 commented 2 months ago

Hi @meren,

Thank you for your fast reply.

I have checked the version of anvio and I see I have the last version (version 8) as you can see in the beginning of the conda list output:

Then If I run anvi-db-info over

A) the db resulting from anvi-gen-contigs-database -L 0 -T $THREADS --project-name $PREFIX -f .../results_step0_reformat/cogs/contigs-fasta.fasta -o ${PREFIX}_cogs.db --force-overwrite,

I get that there are no available sources.

B) the db resulting from anvi-run-pfams -T $THREADS --pfam-data-dir Pfam_v32 -c $PREFIX.db I get Pfam (as expected) as an available source.

So after running anvi-db-info over the $PREFIX.db, I get:

AVAILABLE GENE CALLERS

===============================================

AVAILABLE FUNCTIONAL ANNOTATION SOURCES

===============================================

AVAILABLE HMM SOURCES

===============================================


I already tried running this with COG14 as stated in the github page. However either anvi-setup-ncbi-cogs or anvi-run-ncbi-cogs does not seem to work as when I run anvi-db-info over the db resulting from anvi-run-ncbi-cogs -T $THREADS --cog-version COG14 --cog-data-dir COG_2014 -c ${PREFIX}_cogs.db do not get any available sources.

-> I also saw that running the same pipeline for Pfam and COG14, in the latter I do not get the output ${PREFIX}-cogs.txt from anvi-export-functions --annotation-sources COG14_FUNCTION -c ${PREFIX}_cogs.db -o ${PREFIX}-cogs.txt.

As I could manage to generate ${PREFIX}_cogs.db with anvi-run-ncbi-cogs -T $THREADS --cog-version COG14 --cog-data-dir COG_2014 -c ${PREFIX}_cogs.db, I would imagine that there is a problem with the command anvi-export-functions --annotation-sources COG14_FUNCTION -c ${PREFIX}_cogs.db -o ${PREFIX}-cogs.txt, but more specifically to anvi-setup-ncbi-cogs or anvi-run-ncbi-cogs.

And I still see the following errors in the sdt output (now 1 instead of the 3):

Config Error: Something went wrong with your download attempt. Here is the problem for the url ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2014/data/cog2003-2014.csv: '<urlopen error [Errno 113] No route to host>'

AND also appears

File .../.conda/envs/plasx/lib/python3.12/site-packages/plasx/pd_utils.py", line 1044, in read_table raise Exception('File {} does not exist'.format(A)) Exception: File gene-catalog-ORFs-cogs.txt does not exist

I hope you can help me to solve this. Thank you very much.

meren commented 2 months ago

Hi @pdy1084,

Your problem lies in the fact that the computer on which you are doing this analysis has no access to nih.gov as it is suggested by this message:

Config Error: Something went wrong with your download attempt. Here is the problem for the url
ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2014/data/cog2003-2014.csv: '<urlopen
error [Errno 113] No route to host>'

You need to successfully run anvi-setup-ncbi-cogs for things to move forward. I'm sorry.

meren commented 2 months ago

(perhaps you should talk to your sys admin if you are on your university server)