merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
443 stars 145 forks source link

[BUG] The number of splits in a bin containing all splits does not equal the number of items from the interactive interface #2373

Open mschecht opened 1 week ago

mschecht commented 1 week ago

Short description of the problem

The number of splits in a bin containing all splits in the interactive interface does not equal the number of items of misc-data.

anvi'o version

$ anvi-self-test --version
Anvi'o .......................................: marie (v8-dev)
Python .......................................: 3.10.13

Profile database .............................: 40
Contigs database .............................: 24
Pan database .................................: 21
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

System info

macOS Sonoma 14.6.1

Detailed description of the issue

Hi anvi'o community! In my analyses, I use bins to group items that have NA misc-data with their surround splits to pass along information. Unfortunately, I began noticing discrepancies in the number of items of misc-data vs a number of total splits in a collection of all splits i.e when I export a collection containing all splits from a profile-db it does not equal the number of items in the misc-data.

Here is an example with a bin with everything: image

It has 8,641 splits: image

You can reproduce the above here:

cd TEST/

anvi-interactive

and load the collection: IQtree_test_all_bin (it's a big interface and may take a second to load)

However, the number of leaves of the tree and the number of items in the misc-data do not match:

anvi-export-misc-data -p PROFILE.db --target-data-table items -o items.txt

$ wc -l  items.txt
8683 items.txt

8683 without the headers

Furthermore, the tree in the interface has the same number of leaves at items misc data:

library("ape")

> read.tree("Ribosomal_L14-AA_subset_remove_long_seqs_aligned_maxiters_2_trimmed_filtered_IQTREE_ultrafast_bootstrap.contree)

Phylogenetic tree with 8682 tips and 8678 internal nodes.

Tip labels:
  TARA_SAMEA4397472_METAG_Ribosomal_L14_000000000033, TARA_SAMEA4397472_METAG_Ribosomal_L14_000000000095, TARA_SAMEA2623059_METAG_Ribosomal_L14_000000000001, TARA_SAMEA4397930_METAG_Ribosomal_L14_000000000016, TARA_SAMEA2620970_METAG_Ribosomal_L14_000000000095, BGEO_SAMN07136678_METAG_Ribosomal_L14_000000000018, ...
Node labels:
  , 100, 89, 56, 14, 30, ...

Unrooted; includes branch lengths.

8682 tree tips

I started chatting with @metehaansever about this issue last week but here is the formal documentation of the bug. Thanks in advance for the help and support.

Files / commands to reproduce the issue

https://uchicago.box.com/s/ggg4xso3qxrvdjphsyx006ay1uuzvjcd

FlorianTrigodet commented 1 week ago

Not a bug from the interface, but there is something wrong with the tree called Rooted final A. That tree contains less items than what is present in the contigs.db.

# export the bad tree and a good one
$ anvi-export-items-order -p PROFILE.db -o Rooted_final_A.txt --name Rooted_final_A
$ anvi-export-items-order -p PROFILE.db -o IQTree.txt --name IQTree

#count num of item (they all contain 'split')
$ grep -o "split" Rooted_final_A.txt | wc -l
8641
$ grep -o "split" IQTree.txt | wc -l
8682

How is that possible to have a tree with less leafs than items in a profile.db, I have no idea. If you try to re-import the bad tree, anvi'o complains:

$ anvi-import-items-order -p PROFILE.db -i Rooted_final_A.txt --name toto
Target database ..............................: PROFILE.db
Database type ................................: profile

Order file path ..............................: Rooted_final_A.txt
Order data type ..............................: newick
Order name ...................................: toto

Config Error: Ehem. There is something wrong with the incoming items order data here :/
              Basically, the names found in your input data do not match to the item names
              found in the database. For example, this item
              "BATS_SAMN08390924_METAG_Ribosomal_L14_000000000108_split_00001" is in your
              database, but not in your input data
mschecht commented 1 week ago

Thanks for diving in @FlorianTrigodet! I made a reproducible example with the files attached above. At step 3 is exactly where the number of items in the EVERYTHING bin change from 8,862 → 8,641. It has to do with rotating the tree.

Step 1:

Change items order to IQTree image

image

correct number of leaves

Step 2:

Root here:

image

image

image

correct number of leaves

Step 3:

Rotate here:

image

image

image

Mysteriously 221 leaves disappear :(