merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
415 stars 142 forks source link

[BUG] anvi-interactive crashes when using a collection and a external tree #2167

Closed pedres closed 8 months ago

pedres commented 8 months ago

Short description of the problem

anvi-interactive crashes when running with a contig.db, a profile.db, and a collection of external bins and a external tree

anvi'o version

anvi-self-test --version Anvi'o .......................................: hope (v7.1)

Profile database .............................: 38 Contigs database .............................: 20 Pan database .................................: 15 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 2 tRNA-seq database ............................: 2

anvi-self-test --version

System info

Ubuntu 22 Anvio instaled in a conda environment

Detailed description of the issue

I have created a contig.db from a contig.fa file, then I mapped the reads to the contig and make a profiledb. Finally, I imported a collection of external bins, get the concatenated aminoacid fasta file and the tree file. The problem appers when I tried to use anvi-interactive that gives an error. When I tried to use anvi-interactive without the external bins do not work too, but in this case it seems that the error comes from the absence of any hierarchical clustering. I did not pass the --cluster-contigs flag when run anvi-profile because this is one sample from a group of three, and a I will do that after merging the three profiles. Below I paste what anvio said after running anvi-interactive:

Contigs DB ...................................: Initialized: ss1-CONTIGS.db (v. 20)
Interactive mode .............................: collection

WARNING

ProfileSuperClass found a collection focus, which means it will be initialized using only the splits in the profile database that are affiliated with the collection MAGS and all bins it describes.

Auxiliary Data ...............................: Found: ss1/AUXILIARY-DATA.db (v. 2)
Profile Super ................................: Initialized with 20209 of 435975 splits: ss1/PROFILE.db (v. 38)

THE MORE YOU KNOW ?

Someone asked the Contigs Superclass to initialize only a subset of contig sequences. Usually this is a good thing and means that some good code somewhere is looking after you. Just FYI, this class will only know about 17,938 contig sequences instead of all the things in the database.

Additional Tree ..............................: Splits will be organized based on 'MAGS-tree:unknown:unknown'.
Traceback (most recent call last): File "/home/fulgencio/miniconda3/envs/anvio-7.1/bin/anvi-interactive", line 122, in d = interactive.Interactive(args) File "/home/fulgencio/miniconda3/envs/anvio-7.1/lib/python3.6/site-packages/anvio/interactive.py", line 254, in init self.load_collection_mode() File "/home/fulgencio/miniconda3/envs/anvio-7.1/lib/python3.6/site-packages/anvio/interactive.py", line 1121, in load_collection_mode self.p_meta['default_item_order'] = get_default_item_order_name(default_clustering_class, self.p_meta['available_item_orders']) File "/home/fulgencio/miniconda3/envs/anvio-7.1/lib/python3.6/site-packages/anvio/dbops.py", line 4945, in get_default_item_order_name default_item_order = list(item_orders_dict.keys())[0] AttributeError: 'list' object has no attribute 'keys'

Files / commands to reproduce the issue

bowtie2-build ss1.fa ss1 bowtie2 --threads 24 -x ss1 -1 shot_pathog/ss1_R1.fastq.gz \ -2 shot_pathog/ss1_R2.fastq.gz \ --no-unal \ -S ss1.sam samtools view -@ 24 -F 4 -bS ss1.sam > ss1-RAW.bam samtools sort -@ 24 ss1-RAW.bam -o ss1.bam samtools index -@ 24 ss1.bam rm ss1.sam ss1-RAW.bam

anvi-profile -c ss1-CONTIGS.db \ -S ss1 \ -i ss1.bam \ --profile-SCVs \ --num-threads 16 \ -S ss1

anvi-import-collection -c ss1-CONTIGS.dg \ -p ss1/ss1-PROFILE.db \ -C MAGS --contigs-mode MAGS.txt

anvi-get-sequences-for-hmm-hits -c ss1-CONTIGS.db \ -p ss1/PROFILE.db \ -o mags_amino.fa \ -C MAGS \ --hmm-source Bacteria_71 \ --gene-names Ribosomal_L1,Ribosomal_L2,Ribosomal_L3,Ribosomal_L4,Ribosomal_L5,Ribosomal_L6 \ --return-best-hit \ --get-aa-sequences \ --concatenate

anvi-gen-phylogenomic-tree -f mags_amino.fa \ -o MAGS-tree.txt

anvi-interactive -p ss1/PROFILE.db -c ss1-CONTIGS.db -C MAGS -t MAGS-tree.txt

I am uploading files to drive. I will edit this post with the link when it finish

meren commented 8 months ago

Could it be possible some of the MAGs in your collection didn't have any of these genes and was excluded from the MAGs-tree.txt?

pedres commented 8 months ago

That was the first error I had when importing the collection and run all the commands because a large amount of MAGS had not any of requested genes and the anvi-get-sequences-for-hmm-hits did not include them in its output. I fixed it importing only 67 MAGS and then extracting the aminoacid fasta and did the tree. Then I check that the collection and tree have the same MAGs and run anvi-interactive. In addition, I am using anvio to join MAGs obtained independently from several types of samples (three replicates per sample type). The process of MAG processing was shotgun + Hi-C + PhaseGenomics blackbox deconvolution for each sample. I have a contig file, MAGs and shotgun reads for each sample (biological replicate). My approach will be to join the three contig files (joined_contig.fa) and make a contig.db file, from which I will get single-copy core genes. Next I will map shotgun reads for each biological replicate to an index build with the joined_contig.fa and do three profiles, that will be merged. Once merged I will import MAGS as a collection. Finally I would follow the section of Tara tutorial of “Combining MAGs from...” Since MAGS have the same names across biological replicates (bin_1... bin_N) I renamed them addind sample name (sample1_bin_1). I did the same with contigs names (k141_1 to sample1_k141_1) to avoid conflicts when importing the MAG collection. It this a good approach? Or it would be better and easier to treat every sample as independent creating its own contig.db, profile.db and collection of MAGs) and then join them after refining MAGs following “Combining MAGs from...” Thanks a lot for your help and advice. Manuel

pedres commented 8 months ago

https://drive.google.com/drive/folders/1BBc4IaeJZpfmaoKVWk7jWbs-k4ilHN0_?usp=sharing files

meren commented 8 months ago

Hey @pedres, can you please update your anvi'o to v8 and try again? I just realized you're still on v7.1. We're unable to support earlier versions of anvi'o as we don't have the human resources for that.

meren commented 8 months ago

If you run into the same error, I will then carefully go through your files and try to help you -- thank you very much for your patience! :)

pedres commented 8 months ago

Ok,

Thanks a lot. I will try it Monday on another computer with anvio-8. In fact, I have to update my home computer since in the lab and in the computer facility I have installed anvio 8.

Regards,

Manuel


De: A. Murat Eren (Meren) @.> Enviado: sábado, 4 de noviembre de 2023 12:04 Para: merenlab/anvio @.> Cc: Manuel Aira Vieira @.>; Mention @.> Asunto: Re: [merenlab/anvio] [BUG] anvi-interactive crashes when using a collection and a external tree (Issue #2167)

If you run into the same error, I will then carefully go through your files and try to help you -- thank you very much for your patience! :)

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/merenlab/anvio/issues/2167*issuecomment-1793412967__;Iw!!D9dNQwwGXtA!VezeZNhdCWLbtAcjTz-I8kRJ3ym2gs0QwXU-q2gDMi5li3cdWZVel9bLjv1re8YMiseVmSgEH0UisiS-lv6P5w$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AGJ25ZYKGRUPKSZJKYD5F5LYCYOKXAVCNFSM6AAAAAA64QMTCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJTGQYTEOJWG4__;!!D9dNQwwGXtA!VezeZNhdCWLbtAcjTz-I8kRJ3ym2gs0QwXU-q2gDMi5li3cdWZVel9bLjv1re8YMiseVmSgEH0UisiReS5GIHw$. You are receiving this because you were mentioned.Message ID: @.***>

pedres commented 8 months ago

Hi, I have just tested the issue with anvio-8 (installed with mamba in an environment following the instructions of web), and it stills fails. Below are the commands used and the error message. I have also tried to run anvi-interactive in manual mode and also gives an error anvi-interactive -p test.db -f mags_amino.fa -t MAGS-tree.txt --manual-mode

Config Error: Some of the names in your view data does not have corresponding entries in the FASTA file you provided. Here is an example to one of those 64 names that occur in your data file, but not in the FASTA file: "bin_61"

However, the output of grep -o "bin_64" mags_amino.fa | wc -l is "1" as it is the output of grep -o "bin_64" MAGS-tree.txt | wc -l. The funny or curious thing is that if I run again the anvi-interactive command it gives the same error but with other bin, for example bin_23. Again, that bin is in the MAGS-tree.txt and the mags_amino.fa files. I have attached these two files because it seems that the problem is in there. Thanks again for your help.

https://drive.google.com/drive/folders/1-jJCDlBsGDSriKWfMNf0LQXTmhXQ73lq?usp=sharing

anvi-migrate --migrate-safely ss1-CONTIGS.db

anvi-import-collection -c ss1-CONTIGS.db \ -p ss1/PROFILE.db \ -C MAGS --contigs-mode MAGS.txt

64 bins in the collection

anvi-get-sequences-for-hmm-hits -c ss1-CONTIGS.db \ -p ss1/PROFILE.db \ -o mags_amino.fa \ -C MAGS \ --hmm-source Bacteria_71 \ --gene-names Ribosomal_L1,Ribosomal_L2,Ribosomal_L3,Ribosomal_L4,Ribosomal_L5,Ribosomal_L6 \ --return-best-hit \ --get-aa-sequences \ --concatenate

grep -o ">" *.fa | wc -l ### to check that there were 64 bins in the aminoacid file

anvi-gen-phylogenomic-tree -f mags_amino.fa -o MAGS-tree.txt

anvi-interactive -p ss1/PROFILE.db -c ss1-CONTIGS.db -C MAGS -t MAGS-tree.txt

Anvi'o .......................................: marie (v8) Python .......................................: 3.10.12

Profile database .............................: 38 Contigs database .............................: 21 Pan database .................................: 16 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 4 tRNA-seq database ............................: 2

Traceback (most recent call last):
File "/media/fulgencio/DATOS/conda/envs/anvio-8/bin/anvi-interactive", line 122, in d = interactive.Interactive(args) File "/media/fulgencio/DATOS/conda/envs/anvio-8/lib/python3.10/site-packages/anvio/interactive.py", line 211, in init self.completeness = Completeness(self.contigs_db_path) File "/media/fulgencio/DATOS/conda/envs/anvio-8/lib/python3.10/site-packages/anvio/completeness.py", line 45, in init self.SCG_domain_predictor = scgdomainclassifier.Predict(argparse.Namespace(), run=terminal.Run(verbose=False), progress=self.progress) File "/media/fulgencio/DATOS/conda/envs/anvio-8/lib/python3.10/site-packages/anvio/scgdomainclassifier.py", line 234, in init SCGDomainClassifier.init(self, args, run, progress) File "/media/fulgencio/DATOS/conda/envs/anvio-8/lib/python3.10/site-packages/anvio/scgdomainclassifier.py", line 73, in init self.rf.initialize_classifier() File "/media/fulgencio/DATOS/conda/envs/anvio-8/lib/python3.10/site-packages/anvio/learning.py", line 103, in initialize_classifier classifier_obj = pickle.load(open(self.classifier_object_path, 'rb')) File "sklearn/tree/_tree.pyx", line 728, in sklearn.tree._tree.Tree.setstate File "sklearn/tree/_tree.pyx", line 1432, in sklearn.tree._tree._check_node_ndarray ValueError: node array from the pickle has an incompatible dtype:

meren commented 8 months ago

Hey @pedres,

Sorry for the not-so-helpful error messages here. I think all your downstream issues is due to a very simple problem: the deflines in your FASTA file looks like this (because anvi'o reported them as such, it is not your fault):

>bin_34 num_genes:6|genes:Ribosomal_L1,Ribosomal_L2,Ribosomal_L3,Ribosomal_L4,Ribosomal_L5,Ribosomal_L6|separator:XXX

But every other program in anvi'o wants the deflines in your FASTA file to look like this so bin names can be connected to the individual sequences and so on:

>bin_34

When I run this to remove the excessive information from the FASTA file using this command,

sed -i '' 's/ .*$//g' mags_amino.fa

then the next command run without any issue:

anvi-interactive -p test.db -f mags_amino.fa -t MAGS-tree.txt --manual-mode

I think the same will happen with the rest of the commands you've been trying to run if you were to use this new FASTA file.

Best wishes, Meren