Open michaelkyu opened 3 years ago
@mschecht, I thought we had fixed this because you had identified something similar. Am I confused here?
@meren I recently made a PR to add filter functionality for anvi-run-hmms
by utilizing target and query coverages but I have worked on any part of the code base that has to do with ANI.
Hey @michaelkyu, sorry about this. It seems that that parameter is a pyANI
only parameter:
Since its a pyANI parameter, this explains why it had no effect on your dataset.
Ideally those filter heuristics would be applied regardless of which driver is selected. The way things are coded, that parameter has complex behavior with other parameters, such as --significant-alignment-length
(seen here). This complicates something that shouldn't be complicated.
Basically, solving this problem properly would require a decent pull request that consolidates filtering behavior between fastANI and pyANI.
At the very least I want to raise an error if someone tries to set incompatible program parameters, that way no one wastes their time with this again. However, I can't think of a way to determine whether a value coming from an argparse
namespace is set as the default value, or whether it was explicitly set by the user :\
I would suggest running fastANI with the program anvi-compute-genome-similarity
. Then, you could go into the output folder, and manually set any values you deem fit to 0. Afterwards, you could run anvi-dereplicate-genomes
using the flag --ani-dir
, where you supply the output directory generated by anvi-compute-genome-similarity
(that you had edited some of the values for manually).
There appears to be a fastANI
parameter called --min-fraction
that was introduced in this commit: dd651318e4. The description goes:
--min-fraction MIN_FRACTION
Minimum fraction of alignment to be shared between
genome pairs to calculate ANI. If reference and query
genome size differ, smaller one among the two is
considered. The default is 0.25.
As I recall, this got added to fastANI. I wonder if it does the same thing? Tagging @meren in case he remembers.
Thanks for clarifying this, @ekiefl.
As I recall, this got added to fastANI. I wonder if it does the same thing? Tagging @meren in case he remembers.
My commit was only to update the parameter as it seemed to have changed from one version to the next in pyANI :/ I haven't tested if the behavior, thinking that it must have remained the same.
Thanks @ekiefl for the explanation and workaround!
Short description of the problem
The
--min-alignment-fraction
parameter ofanvi-dereplicate-genomes
is not working as expected.anvi'o version
I'm using the master version as of this post (February 25, 2021)
Anvi'o .......................................: hope (v7-dev)
Profile database .............................: 35 Contigs database .............................: 20 Pan database .................................: 14 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 2 tRNA-seq database ............................: 1
System info
Installed anvi'o through conda. Installed on CentOS v7
Detailed description of the issue
If my understanding is correct, the
--min-alignment-fraction
parameter ofanvi-dereplicate-genomes
should set the similarity between two sequences to 0 if the coverage of EITHER sequence is below the specified value. This doesn't seem to be working (or my understanding is incorrect).To demonstrate this issue, I created three RANDOM sequence fragments
Fragment 1: (800 nt)
CTCCACCTGCGTCTATCGTACAGTGCTAAAATGGCAGCAGATAGTGAAACTTCCGCTAAGCTAGCCCCTCAGGGTACACTGCACCGAGGCGTGCTGTCAATACTTGATTAAGTCGGGTTGTCGGGGACCTGCCGTCACGCTTCCGAGTGTATATCCGGATTTAGTTGACGTCATACAGAGGCACTAAGAAGAATAAACGCTTACCTCCAGCAATCGTGTAGTGTCAGGTGTACGTTCTCCCTTGCGCCGTTCGGCAAGCGTCCGGTGTCGGGCTGCAAGGAATAAATCTTTATGGACCAGAGGGGCTTGTTTCCTCATATGGGTGCGTGCACTTATACGATTCAAAGGTGGATATGGCCGCATAACACGTAGCCAGGCTATAGTCCCGCGGCCTAATTCCTTCGAGTGCGGGTGCCTGTTTTTGTTTTTCCTTTACGACACGAACCGCTCTAACCTGCTCTATTTCGCCACGTTCCAGTGAACCTCTTAGCCTACCGCCCACGTACGGTGGGACGCGTCGAGCAGTTAAGGTACTGTGGAGAAATCGTTCAATATTAGAAAACAGGCGGTGTACGAATTACTGTGTCCCGGTGTTGCCCGTTTAACGGCTGCCGTGGTCATACCGTGAGGCACCACGAGGGGATGCTACGCAACATGCGAGGTGTAATCAGCAGGGAAGATCCCGGGGATCGAAAGCGGTCCGCGATTTGCGACCGATATGCATAAGGTGTCATTTATATTACACCTACGACACTGGTACCGGCTCACAGCCAAATGCACAGTCTCAAGATAGAATTCGC
Fragment 2: (100 nt)
AAAAATTAGGCTTATCGGGCGCTTACTCTTTGTTACACTTCTGGTCTGTGAGTGACGCCCTGTGTCCCATCACTGCATGTGAGGATGCGTGTACTGCACC
Fragment 3: (100 nt)
TGTGAATTACAGACGTTTCCCTACCAGGGCGCTACGTTATAATGTTCGGTTGCAACCCTCTATAGGGTGATCTCGACATACCACTATGGTTTGCGCGTCT
Next, I concatenated the fragments to create two sequences: seqA (fragment 1 + 2) and seqB (fragment 2+3)
By construction, the sequences have a perfect match of 800nt (fragment 1), so their similarity is 100%, before taking coverage into consideration. The coverage of each sequence is 800/900 = 88.9%.
I dereplicated with this command (see dropbox link for files):
anvi-dereplicate-genomes --debug -f input.txt -o out --program fastANI --fragment-length 100 --min-alignment-fraction 0.95 --similarity-threshold 0.95 -T 250 --skip-checking-genome-hashes --log-file out.log --representative-method length
Because
--min-alignment-fraction
is set to 95%, I expect the similarity to be set to 0, so that both sequences should be kept are dereplicating. Instead, the sequences are collapsed to a single cluster. The output files areSIMILARITY_SCORES/fastANI_alignment_fraction.txt
:SIMILARITY_SCORES/fastANI_ani.txt
:It seems that anvi'o correctly calculated that the alignment coverage is 88.9%, BUT it did not set the similarity to 0.
Files to reproduce
This zip file contains both the input and output files.
https://www.dropbox.com/s/s3ndnihp0gaw4nt/anvio_dereplicate_genomes_bug_2-25-21.zip?dl=0