merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
427 stars 145 forks source link

[BUG] `--min-alignment-fraction` in anvi-dereplicate-genomes not working as expected #1684

Open michaelkyu opened 3 years ago

michaelkyu commented 3 years ago

Short description of the problem

The --min-alignment-fraction parameter of anvi-dereplicate-genomes is not working as expected.

anvi'o version

I'm using the master version as of this post (February 25, 2021)

Anvi'o .......................................: hope (v7-dev)

Profile database .............................: 35 Contigs database .............................: 20 Pan database .................................: 14 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 2 tRNA-seq database ............................: 1

System info

Installed anvi'o through conda. Installed on CentOS v7

Detailed description of the issue

If my understanding is correct, the --min-alignment-fraction parameter of anvi-dereplicate-genomes should set the similarity between two sequences to 0 if the coverage of EITHER sequence is below the specified value. This doesn't seem to be working (or my understanding is incorrect).

To demonstrate this issue, I created three RANDOM sequence fragments

Fragment 1: (800 nt) CTCCACCTGCGTCTATCGTACAGTGCTAAAATGGCAGCAGATAGTGAAACTTCCGCTAAGCTAGCCCCTCAGGGTACACTGCACCGAGGCGTGCTGTCAATACTTGATTAAGTCGGGTTGTCGGGGACCTGCCGTCACGCTTCCGAGTGTATATCCGGATTTAGTTGACGTCATACAGAGGCACTAAGAAGAATAAACGCTTACCTCCAGCAATCGTGTAGTGTCAGGTGTACGTTCTCCCTTGCGCCGTTCGGCAAGCGTCCGGTGTCGGGCTGCAAGGAATAAATCTTTATGGACCAGAGGGGCTTGTTTCCTCATATGGGTGCGTGCACTTATACGATTCAAAGGTGGATATGGCCGCATAACACGTAGCCAGGCTATAGTCCCGCGGCCTAATTCCTTCGAGTGCGGGTGCCTGTTTTTGTTTTTCCTTTACGACACGAACCGCTCTAACCTGCTCTATTTCGCCACGTTCCAGTGAACCTCTTAGCCTACCGCCCACGTACGGTGGGACGCGTCGAGCAGTTAAGGTACTGTGGAGAAATCGTTCAATATTAGAAAACAGGCGGTGTACGAATTACTGTGTCCCGGTGTTGCCCGTTTAACGGCTGCCGTGGTCATACCGTGAGGCACCACGAGGGGATGCTACGCAACATGCGAGGTGTAATCAGCAGGGAAGATCCCGGGGATCGAAAGCGGTCCGCGATTTGCGACCGATATGCATAAGGTGTCATTTATATTACACCTACGACACTGGTACCGGCTCACAGCCAAATGCACAGTCTCAAGATAGAATTCGC

Fragment 2: (100 nt) AAAAATTAGGCTTATCGGGCGCTTACTCTTTGTTACACTTCTGGTCTGTGAGTGACGCCCTGTGTCCCATCACTGCATGTGAGGATGCGTGTACTGCACC

Fragment 3: (100 nt) TGTGAATTACAGACGTTTCCCTACCAGGGCGCTACGTTATAATGTTCGGTTGCAACCCTCTATAGGGTGATCTCGACATACCACTATGGTTTGCGCGTCT

Next, I concatenated the fragments to create two sequences: seqA (fragment 1 + 2) and seqB (fragment 2+3)

By construction, the sequences have a perfect match of 800nt (fragment 1), so their similarity is 100%, before taking coverage into consideration. The coverage of each sequence is 800/900 = 88.9%.

I dereplicated with this command (see dropbox link for files):

anvi-dereplicate-genomes --debug -f input.txt -o out --program fastANI --fragment-length 100 --min-alignment-fraction 0.95 --similarity-threshold 0.95 -T 250 --skip-checking-genome-hashes --log-file out.log --representative-method length

Because --min-alignment-fraction is set to 95%, I expect the similarity to be set to 0, so that both sequences should be kept are dereplicating. Instead, the sequences are collapsed to a single cluster. The output files are

SIMILARITY_SCORES/fastANI_alignment_fraction.txt:

key     seqA    seqB
seqA    1.0     0.8888888888888888
seqB    0.8888888888888888      1.0

SIMILARITY_SCORES/fastANI_ani.txt:

key     seqA    seqB
seqA    0.999959        1.0
seqB    1.0     0.999917

It seems that anvi'o correctly calculated that the alignment coverage is 88.9%, BUT it did not set the similarity to 0.

Files to reproduce

This zip file contains both the input and output files.

https://www.dropbox.com/s/s3ndnihp0gaw4nt/anvio_dereplicate_genomes_bug_2-25-21.zip?dl=0

meren commented 3 years ago

@mschecht, I thought we had fixed this because you had identified something similar. Am I confused here?

mschecht commented 3 years ago

@meren I recently made a PR to add filter functionality for anvi-run-hmms by utilizing target and query coverages but I have worked on any part of the code base that has to do with ANI.

ekiefl commented 3 years ago

Sorry

Hey @michaelkyu, sorry about this. It seems that that parameter is a pyANI only parameter:

image

What to do

Since its a pyANI parameter, this explains why it had no effect on your dataset.

Ideally those filter heuristics would be applied regardless of which driver is selected. The way things are coded, that parameter has complex behavior with other parameters, such as --significant-alignment-length (seen here). This complicates something that shouldn't be complicated.

Basically, solving this problem properly would require a decent pull request that consolidates filtering behavior between fastANI and pyANI.

At the very least I want to raise an error if someone tries to set incompatible program parameters, that way no one wastes their time with this again. However, I can't think of a way to determine whether a value coming from an argparse namespace is set as the default value, or whether it was explicitly set by the user :\

The workaround

I would suggest running fastANI with the program anvi-compute-genome-similarity. Then, you could go into the output folder, and manually set any values you deem fit to 0. Afterwards, you could run anvi-dereplicate-genomes using the flag --ani-dir, where you supply the output directory generated by anvi-compute-genome-similarity (that you had edited some of the values for manually).

Other

There appears to be a fastANI parameter called --min-fraction that was introduced in this commit: dd651318e4. The description goes:

  --min-fraction MIN_FRACTION
                        Minimum fraction of alignment to be shared between
                        genome pairs to calculate ANI. If reference and query
                        genome size differ, smaller one among the two is
                        considered. The default is 0.25.

As I recall, this got added to fastANI. I wonder if it does the same thing? Tagging @meren in case he remembers.

meren commented 3 years ago

Thanks for clarifying this, @ekiefl.

As I recall, this got added to fastANI. I wonder if it does the same thing? Tagging @meren in case he remembers.

My commit was only to update the parameter as it seemed to have changed from one version to the next in pyANI :/ I haven't tested if the behavior, thinking that it must have remained the same.

michaelkyu commented 3 years ago

Thanks @ekiefl for the explanation and workaround!