VFDB contains few genes that are not part of any cluster

PovilasMat commented 1 year ago

Hi,

ariba was running into weird issue while running on vf database: [E::hts_idx_push] Unsorted positions on sequence # 1: 109 followed by 11 OSError: building of index for /scratch/shadow/tmpr7wt7j_c/ariba_virulencefinder/ariba_virulencefinder/read_store.gz failed

I figured that it was because read_store.gz is incorrectly sorted because one of the genes doesnt have cluster information. I changed read_store.py to sort correctly even with cluster information missing but then it failed in future step: _init_and_run_clusters reference_names=self.cluster_ids[cluster_name], KeyError: ''

Obviously, because cluster name was missing. :)

Then I started digging around and made this small test:

mkdir vftest cd vftest ariba getref virulencefinder out.virulencefinder ariba prepareref -f out.virulencefinder.fa -m out.virulencefinder.tsv ./test cd test cat 02.cdhit.clusters.tsv | awk '{$1="";print}' | tr " " "\n" | sort | uniq > cluster_file grep ">" 02.cdhit.all.fa | sed 's/>//g' | sort > all_file wc -l all_file wc -l cluster_file diff cluster_file all_file

Output of the last three lines:

5558 all_file 5554 cluster_file //cluster file contains one empty line in the beginning 1d0 //this is the empty line < //this is the empty line 718a718 > csnA_4_KJ922517 973a974 > eltIIAB_c8_1_AASRQF010000005 4943a4945 > stx2_122_CP022279_122 5082a5085 > stx2b_O128_24196_97_95_AJ567995_95 5157a5161 > stx2h_O102_STEC299_122_CP022279_122

So the issue is because one or more of those 5 genes (in my case stx2h_O102_STEC299_122_CP022279_122) can be found in my sequencing reads but they are not part of any cluster. Whenever read_store is made, they do not contain any cluster name which fails the script.

ariba version ARIBA version: 2.14.6 External dependencies: bowtie2 2.2.5 /srv/data/tools/anaconda3/envs/env_cge_update/bin/bowtie2 cdhit 4.8.1 /srv/data/tools/anaconda3/envs/env_cge_update/bin/cd-hit-est nucmer 3.1 /srv/data/tools/anaconda3/envs/env_cge_update/bin/nucmer spades 3.15.5 /srv/data/tools/anaconda3/envs/env_cge_update/bin/spades.py External dependencies OK: True Python version: 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:45:29) [GCC 10.4.0] Python packages: ariba 2.14.6 /srv/data/tools/anaconda3/envs/env_cge_update/lib/python3.9/site-packages/ariba/init.py bs4 4.11.1 /srv/data/tools/anaconda3/envs/env_cge_update/lib/python3.9/site-packages/bs4/init.py dendropy 4.5.2 /srv/data/tools/anaconda3/envs/env_cge_update/lib/python3.9/site-packages/dendropy/init.py pyfastaq 3.17.0 /srv/data/tools/anaconda3/envs/env_cge_update/lib/python3.9/site-packages/pyfastaq/init.py pymummer 0.11.0 /srv/data/tools/anaconda3/envs/env_cge_update/lib/python3.9/site-packages/pymummer/init.py pysam 0.18.0 /srv/data/tools/anaconda3/envs/env_cge_update/lib/python3.9/site-packages/pysam/init.py Python packages OK: True Everything looks OK: True

etuduri commented 1 year ago

Hi, I have the same issue, please help!!

ARIBA version: 2.14.6

External dependencies: bowtie2 2.3.4.1 /usr/bin/bowtie2 cdhit 4.7 /usr/bin/cd-hit-est nucmer 3.1 /usr/bin/nucmer spades 3.13.0 /home/inei/SPAdes-3.13.0-Linux/bin/spades.py

External dependencies OK: True

Python version: 3.6.9 (default, Mar 10 2023, 16:46:00) [GCC 8.4.0]

Python packages: ariba 2.14.6 /usr/local/lib/python3.6/dist-packages/ariba/init.py bs4 4.9.2 /home/inei/.local/lib/python3.6/site-packages/bs4/init.py dendropy 4.4.0 /home/inei/.local/lib/python3.6/site-packages/dendropy/init.py pyfastaq 3.17.0 /home/inei/.local/lib/python3.6/site-packages/pyfastaq/init.py pymummer 0.10.3 /home/inei/.local/lib/python3.6/site-packages/pymummer/init.py pysam 0.16.0.1 /home/inei/.local/lib/python3.6/site-packages/pysam/init.py

Python packages OK: True

Everything looks OK: True

Thanks in advance !!!

PovilasMat commented 1 year ago

It doesnt seem like ariba will receive any future changes. I requested DB maintainers to fix it on their end. But it is still ongoing process.

sanger-pathogens / ariba

VFDB contains few genes that are not part of any cluster #331