Pling! incorrectly computes containment similarity as 0 for highly similar sequences

apcamargo commented 2 months ago

While evaluating how Pling! deals with circularly permuted sequences, I came across a case where it computes the containment similarity of two sequences that are 99.22% similar as 0%. In the example attached, genome_1 and genome_4 differ in a single line of the FASTA file, but Pling computes their containment similarity as 0. Curiously, genome_8 which is just a circularly permuted version of genome_1 gets a containment similarity of 1 to genome_4. Given that genome_1 and genome_8 are identical, it makes no sense that their similarity to genome_4 is different.

plasmid_1       plasmid_2       distance
genome_4        genome_1        1.0
genome_8        genome_1        0.0
genome_8        genome_4        0.0

A particularity of this example is that genome_1 is a palindrome. I don't really know if this is what is causing the issue, though.

I noticed that the <prefix>.1delta output of dnadiff is empty when comparing these genomes. Could this be the reason?

example.zip

babayagaofficial commented 2 months ago

You're completely correct -- the problem here is that nucmer isn't finding any matches, and since we calculate containment from coverage of the matches, here containment distance is 1. Unfortunately there's not much we can do to fix issues stemming from nucmer bugs.

iqbal-lab commented 2 months ago

nucmer usually does work, we've tested quite a lot of circularsation

martinghunt commented 2 months ago

Worth checking nucmer options and if tweaking them helps?

martinghunt commented 2 months ago

Something odd because here's nucmer with default settings, no delta-filter...

1 vs 4:

[S1]    [E1]    [S2]    [E2]    [LEN 1] [LEN 2] [% IDY] [LEN R] [LEN Q] [FRM]   [TAGS]
1   4116    1   4116    4116    4116    99.22   4116    4116    1   1   genome_1    genome_4    [IDENTITY]
1   4116    4116    1   4116    4116    99.22   4116    4116    1   -1  genome_1    genome_4    [IDENTITY]

1 vs 8:

[S1]    [E1]    [S2]    [E2]    [LEN 1] [LEN 2] [% IDY] [LEN R] [LEN Q] [FRM]   [TAGS]
1   1416    1416    1   1416    1416    97.75   4116    4116    1   -1  genome_1    genome_8    [BEGIN]
1   2700    1417    4116    2700    2700    100.00  4116    4116    1   1   genome_1    genome_8    [BEGIN]
1417    4116    4116    1417    2700    2700    98.82   4116    4116    1   -1  genome_1    genome_8    [END]
2701    4116    1   1416    1416    1416    100.00  4116    4116    1   1   genome_1    genome_8    [END]

iqbal-lab commented 2 months ago

Also watch out, pling calculates distance not similarity

apcamargo commented 2 months ago

Oh, I'm aware. When I was referring to "similarity" I meant 1 - distance. I also included the contents of the output file with the distances.

iqbal-lab commented 2 months ago

ah sorry, yes you did

apcamargo commented 2 months ago

No worries! Let me know if there's anything I can do to help figuring out what is going on here

babayagaofficial commented 2 months ago

oh man I think I might know what it is. We use delta-filter with the -1 flag, which is meant to choose the best scored match if there's two matches on the same interval (either on reference or query). In my previous experience, if they were scored the same, both matches would be reported after delta-filter, and I think it says that that is what it's supposed to do somewhere in the documentation (but I'd have to go hunting to check). For some reason here though, delta-filter has discarded both matches instead of reporting both. If I run Pling without the -1 flag though, everything works fine, we get correct containment distances and integerisation.

Since we have implemented overlap fixing and are able to handle duplicate matches, we can in principle remove the -1 and technically not break anything*. However, it may change results for genomes with duplicate regions, including the integerisation and DCJ-Indel distance. Ding deals with duplicate markers by basically figuring out how to pair up the duplicates (between genomes) such that the DCJ-Indel distance is minimal, and anything that is left over is treated as an indel. It doesn't account for sequence similarity between markers when doing this. When we use the -1 flag, we kind of pair up duplicates in advance of doing the DCJ-Indel calculation, which a) improves Ding runtime b) gives us more "realistic" results. With regards to b), I have some toy examples with duplicate genes, where Pling from annotation and Pling from alignment give different results because the duplicate regions get matched up differently when you account for best matches in the alignment approach. Does that make any sense?? TLDR removing the flag would fix this specific edge case, but may produce worse results more generally. I'm not sure there's any better options though.

*Looking at what it says about the -1 flag, it might actually break our projection between coordinates: "-1 can be handy for applications such as SNP finding which require a 1-to-1 mapping" More generally, here's the description of what delta-filter -1 outputs that I was able to find:

-1            1-to-1 alignment allowing for rearrangements
              (intersection of -r and -q alignments)
-q            Maps each position of each query to its best hit in
              the reference, allowing for reference overlaps
-r            Maps each position of each reference to its best hit
              in the query, allowing for query overlaps
An important distinction between the -g option and the -1 and -m
options is that -g requires the alignments to be mutually consistent
in their order, while the -1 and -m options are not required to be
mutually consistent and therefore tolerate translocations,
inversions, etc. In general cases, the -m option is the best choice,
however -1 can be handy for applications such as SNP finding which
require a 1-to-1 mapping.

babayagaofficial commented 2 months ago

I guess another option is not using delta-filter to filter out best matches, but do so ourselves in the match preprocessing. Will still probably change some results though. And might make things slower, I guess?

apcamargo commented 1 month ago

Do you think there are other cases where -1 will discard all equally good alignments? This case might be an extreme example, but it could show something that can happen in more "standard" sequences (e.g., large tandem duplications). I find it particularly weird that this happens between genome_1 and genome_4, but not between genome_4 and genome_8.

Just my 2 cents, though. I'm sure you guys did a lot of testing and know better.

babayagaofficial commented 1 month ago

I honestly don't fully understand what is happening here. The results for delta-filter -r are

    [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  | [TAGS]
=====================================================================================
       1     4116  |     4116        1  |     4116     4116  |    99.22  | genome_4     genome_1

and for delta-filter -q are

    [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  | [TAGS]
=====================================================================================
       1     4116  |     4116        1  |     4116     4116  |    99.22  | genome_4     genome_1

so they're both reporting the same match. delta-filter -1 is supposed to be the intersection of matches from delta-filter -r and delta-filter -q, but somehow when you run delta-filter -1 it reports no matches. Maybe when you run directly with -1, the "best" match that gets chosen for reference and query are different, hence there's nothing in the intersection, which is kind of the best explanation I can think of. I think the reason this is able to happen is because it's a palindrome -- both matches are mapping on the exact same region, just different strands. Usually you'll have say one region on the reference, that matches with two distinct (but still possibly overlapping) regions, in which case one of the matches is chosen as "best" for the reference, but both are kept for the query (because it's a match on two different intervals wrt the query). Then in the intersection, one of the two matches is still reported, so it's kind of okay. Basically tandem duplications won't cause an issue. And honestly, when it comes to duplicates, most of the time you get two (or more) "distinct" matches that both get reported, that we then deal with in our overlap fixing. For example, in a case of duplication you might see something like this:

    [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  | [TAGS]
=====================================================================================
       1     500  |     500        1  |     500     500  |    100  | ref   query
       2     503  |     600        1100  |     501     500  |    98  | ref   query

That's obviously a duplication, but delta-filter will treat those matches as unrelated because none of the intervals on reference or query are exactly the same. The other way (tandem) duplications are often reported in practice is as two overlapping matches, I actually have a figure for this: overlap_example You can see in a) how nucmer would match up the regions on those two genomes -- the duplicated region ends up covered by one match from the right, and a second from the left. Anyway, as for why this isn't an issue for genomes 1 and 8 -- there's a true "best" match for the duplicate matches, i.e. the ones with 100% identity. The -1 flag is correctly discarding the other matches, so we're left with a consistent set of matches in the intersection of -r and -q.

martinghunt commented 1 month ago

Could use delta-filter -qr? That works on 1 vs 4

babayagaofficial commented 1 month ago

That seems like a reasonable option that probably won't break everything, will chase up

babayagaofficial commented 2 weeks ago

Okay, I've got a follow up on -qr vs -1. Here's the DCJ-Indel distances plotted against each other for a test set of 128 IncY plasmids: Figure_1

They mostly stay the same, but not entirely. It's mostly for higher distances that there's any difference, which makes sense since the alignments get increasingly complicated with larger DCJ distances. Looks like for lower distances there's no change, which means cluster results won't change for the majority of cases. So from the perspective of just the clustering, it should in principle be safe to change the flags.

iqbal-lab-org / pling

Pling! incorrectly computes containment similarity as 0 for highly similar sequences #57