Open apcamargo opened 2 months ago
You're completely correct -- the problem here is that nucmer isn't finding any matches, and since we calculate containment from coverage of the matches, here containment distance is 1. Unfortunately there's not much we can do to fix issues stemming from nucmer bugs.
nucmer usually does work, we've tested quite a lot of circularsation
Worth checking nucmer options and if tweaking them helps?
Something odd because here's nucmer with default settings, no delta-filter...
1 vs 4:
[S1] [E1] [S2] [E2] [LEN 1] [LEN 2] [% IDY] [LEN R] [LEN Q] [FRM] [TAGS]
1 4116 1 4116 4116 4116 99.22 4116 4116 1 1 genome_1 genome_4 [IDENTITY]
1 4116 4116 1 4116 4116 99.22 4116 4116 1 -1 genome_1 genome_4 [IDENTITY]
1 vs 8:
[S1] [E1] [S2] [E2] [LEN 1] [LEN 2] [% IDY] [LEN R] [LEN Q] [FRM] [TAGS]
1 1416 1416 1 1416 1416 97.75 4116 4116 1 -1 genome_1 genome_8 [BEGIN]
1 2700 1417 4116 2700 2700 100.00 4116 4116 1 1 genome_1 genome_8 [BEGIN]
1417 4116 4116 1417 2700 2700 98.82 4116 4116 1 -1 genome_1 genome_8 [END]
2701 4116 1 1416 1416 1416 100.00 4116 4116 1 1 genome_1 genome_8 [END]
Also watch out, pling calculates distance not similarity
Oh, I'm aware. When I was referring to "similarity" I meant 1 - distance
. I also included the contents of the output file with the distances.
ah sorry, yes you did
No worries! Let me know if there's anything I can do to help figuring out what is going on here
oh man I think I might know what it is. We use delta-filter with the -1
flag, which is meant to choose the best scored match if there's two matches on the same interval (either on reference or query). In my previous experience, if they were scored the same, both matches would be reported after delta-filter, and I think it says that that is what it's supposed to do somewhere in the documentation (but I'd have to go hunting to check). For some reason here though, delta-filter has discarded both matches instead of reporting both.
If I run Pling without the -1
flag though, everything works fine, we get correct containment distances and integerisation.
Since we have implemented overlap fixing and are able to handle duplicate matches, we can in principle remove the -1
and technically not break anything*. However, it may change results for genomes with duplicate regions, including the integerisation and DCJ-Indel distance. Ding deals with duplicate markers by basically figuring out how to pair up the duplicates (between genomes) such that the DCJ-Indel distance is minimal, and anything that is left over is treated as an indel. It doesn't account for sequence similarity between markers when doing this. When we use the -1
flag, we kind of pair up duplicates in advance of doing the DCJ-Indel calculation, which a) improves Ding runtime b) gives us more "realistic" results. With regards to b), I have some toy examples with duplicate genes, where Pling from annotation and Pling from alignment give different results because the duplicate regions get matched up differently when you account for best matches in the alignment approach. Does that make any sense??
TLDR removing the flag would fix this specific edge case, but may produce worse results more generally. I'm not sure there's any better options though.
*Looking at what it says about the -1
flag, it might actually break our projection between coordinates:
"-1 can be handy for applications such as SNP finding which
require a 1-to-1 mapping"
More generally, here's the description of what delta-filter -1
outputs that I was able to find:
-1 1-to-1 alignment allowing for rearrangements
(intersection of -r and -q alignments)
-q Maps each position of each query to its best hit in
the reference, allowing for reference overlaps
-r Maps each position of each reference to its best hit
in the query, allowing for query overlaps
An important distinction between the -g option and the -1 and -m
options is that -g requires the alignments to be mutually consistent
in their order, while the -1 and -m options are not required to be
mutually consistent and therefore tolerate translocations,
inversions, etc. In general cases, the -m option is the best choice,
however -1 can be handy for applications such as SNP finding which
require a 1-to-1 mapping.
I guess another option is not using delta-filter to filter out best matches, but do so ourselves in the match preprocessing. Will still probably change some results though. And might make things slower, I guess?
Do you think there are other cases where -1
will discard all equally good alignments? This case might be an extreme example, but it could show something that can happen in more "standard" sequences (e.g., large tandem duplications). I find it particularly weird that this happens between genome_1 and genome_4, but not between genome_4 and genome_8.
Just my 2 cents, though. I'm sure you guys did a lot of testing and know better.
I honestly don't fully understand what is happening here. The results for delta-filter -r
are
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS]
=====================================================================================
1 4116 | 4116 1 | 4116 4116 | 99.22 | genome_4 genome_1
and for delta-filter -q
are
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS]
=====================================================================================
1 4116 | 4116 1 | 4116 4116 | 99.22 | genome_4 genome_1
so they're both reporting the same match. delta-filter -1
is supposed to be the intersection of matches from delta-filter -r
and delta-filter -q
, but somehow when you run delta-filter -1
it reports no matches. Maybe when you run directly with -1
, the "best" match that gets chosen for reference and query are different, hence there's nothing in the intersection, which is kind of the best explanation I can think of.
I think the reason this is able to happen is because it's a palindrome -- both matches are mapping on the exact same region, just different strands. Usually you'll have say one region on the reference, that matches with two distinct (but still possibly overlapping) regions, in which case one of the matches is chosen as "best" for the reference, but both are kept for the query (because it's a match on two different intervals wrt the query). Then in the intersection, one of the two matches is still reported, so it's kind of okay. Basically tandem duplications won't cause an issue.
And honestly, when it comes to duplicates, most of the time you get two (or more) "distinct" matches that both get reported, that we then deal with in our overlap fixing. For example, in a case of duplication you might see something like this:
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS]
=====================================================================================
1 500 | 500 1 | 500 500 | 100 | ref query
2 503 | 600 1100 | 501 500 | 98 | ref query
That's obviously a duplication, but delta-filter
will treat those matches as unrelated because none of the intervals on reference or query are exactly the same.
The other way (tandem) duplications are often reported in practice is as two overlapping matches, I actually have a figure for this:
You can see in a) how nucmer would match up the regions on those two genomes -- the duplicated region ends up covered by one match from the right, and a second from the left.
Anyway, as for why this isn't an issue for genomes 1 and 8 -- there's a true "best" match for the duplicate matches, i.e. the ones with 100% identity. The
-1
flag is correctly discarding the other matches, so we're left with a consistent set of matches in the intersection of -r
and -q
.
Could use delta-filter -qr
? That works on 1 vs 4
That seems like a reasonable option that probably won't break everything, will chase up
Okay, I've got a follow up on -qr
vs -1
. Here's the DCJ-Indel distances plotted against each other for a test set of 128 IncY plasmids:
They mostly stay the same, but not entirely. It's mostly for higher distances that there's any difference, which makes sense since the alignments get increasingly complicated with larger DCJ distances. Looks like for lower distances there's no change, which means cluster results won't change for the majority of cases. So from the perspective of just the clustering, it should in principle be safe to change the flags.
While evaluating how Pling! deals with circularly permuted sequences, I came across a case where it computes the containment similarity of two sequences that are 99.22% similar as 0%. In the example attached,
genome_1
andgenome_4
differ in a single line of the FASTA file, but Pling computes their containment similarity as 0. Curiously,genome_8
which is just a circularly permuted version ofgenome_1
gets a containment similarity of 1 togenome_4
. Given thatgenome_1
andgenome_8
are identical, it makes no sense that their similarity togenome_4
is different.A particularity of this example is that
genome_1
is a palindrome. I don't really know if this is what is causing the issue, though.I noticed that the
<prefix>.1delta
output ofdnadiff
is empty when comparing these genomes. Could this be the reason?example.zip