merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
439 stars 145 forks source link

Error when running anvi-merge-collections #1395

Open MelanieCHay opened 4 years ago

MelanieCHay commented 4 years ago

Hello,

Below are the results of anvi-self-test --version

Anvi'o version ...............................: esther (v6.1) Profile DB version ...........................: 31 Contigs DB version ...........................: 14 Pan DB version ...............................: 13 Genome data storage version ..................: 6 Auxiliary data storage version ...............: 2 Structure DB version .........................: 1


I am comparing binning methods on a large number of contigs. I'd like to add the collections as a layer and then do some 'consensus-binning' in anvi-interactive before refining.

I have exported my collections and now have txt files of contigs and bin names.

E.g. c_000000000001_split_00001 Bin_24 c_000000000003_split_00001 Bin_24 c_000000000006_split_00001 Bin_24 c_000000000009_split_00001 Bin_24 c_000000000029_split_00001 Bin_24 c_000000000078_split_00001 Bin_24 c_000000000082_split_00001 Bin_24 c_000000000087_split_00001 Bin_24 c_000000000091_split_00001 Bin_24 c_000000000095_split_00001 Bin_24

So looks good.

I have done this with concoct, metabat2, maxbin2, and dastool.

I then tried to merge the collections using: anvi-script-merge-collections -c CONTIGS.db \ -i additional-files/external-binning-results/*.txt \ -o collections.tsv

But I get an error. It looks like this.

~/data/sval-anvio$ anvi-script-merge-collections -c 03-contigs/sval_mg_contigs.db -i COLL-concoct.txt COLL-maxbin2.txt COLL-metabat2.txt COLL-dastool.txt -o binning_collections.tsv New Source ...................................: COLL-concoct, w/ 163249 contigs New Source ...................................: COLL-maxbin2, w/ 163249 contigs New Source ...................................: COLL-metabat2, w/ 163249 contigs New Source ...................................: COLL-dastool, w/ 163249 contigs Final number of unique contigs ...............: 163,249 Contigs DB ...................................: Initialized: 03-contigs/sval_mg_contigs.db (v. 14)

Config Error: Oh. You have the wrong stuff. Probably. Because, the contig 'c_000000061291_split_00001' does not match to any of the contig names in your database. Here is a random contig name you have in it in comparison: 'c_000000000001'.

MelanieCHay commented 4 years ago

Back again, looks like there might be a mismatch between contig names and split names?

Is this an easy fix? I've looked for contig-mode and split-mode, which is an option for 'anvi-import-collection'. I am also wondering whether to just hack this by editing the tab-delimited file from bin-export and remove the split info and keep the contig info.

meren commented 4 years ago

You did everything right, and that's why you're running into this issue :) Many apologies for this very confusing situation.

Usually people use contig names in files to merge with anvi-script-merge-collections, and the requirement of a contigs database here is to find out the translation of those contig names to split names (so an anvi'o additional data table can be generated with split names).

You already have split names, so all is golden, but anvi'o is treating them as contig names to find out the corresponding split names in the database for each one of them.

To solve this, we need a --splits-mode flag for anvi-script-merge-collections script.

I will take a quick look and write back, @MelanieCHay.

jimen210 commented 7 months ago

Hi, were you able to deal with this issue? I think I am facing a similar problem. Thank you

ivagljiva commented 7 months ago

I don't think we ended up implementing a --splits-mode for this, @jimen210 . If possible, I would suggest using contig names for the script input (by simply removing the _split_xxxxx part from the split names). The only reason I could imagine for not using the contig names is if some of your contigs are split across different bins (ie, one split from the contig is in one bin and another split from the same contig is in a different bin), but that usually means that the binning was done very poorly and I'm not sure it's worth keeping bins like that.