iqbal-lab-org / pling

Plasmid analysis using rearrangement distances
MIT License
28 stars 1 forks source link

Integerise from mdelta file rather than 1delta file #10

Closed babayagaofficial closed 9 months ago

babayagaofficial commented 1 year ago

1delta file selects one match, even if there are multiple, so using it potentially misses duplicates

mdelta will tell you all matches

babayagaofficial commented 10 months ago

so the difference between 1-to-1 alignment and M-to-M alignment in nucmer is apparently as follows:

All the alignment intervals on the reference and query are scored based on some combination of their length and identity, and a subset are marked as the "best" alignment for that region of the genome. -m keeps any alignment that is marked best on EITHER the reference OR query. -1 keeps only alignments that are marked best on BOTH the reference AND query. Another way of saying it: if there is a duplication in the query genome, -m will attempt to keep alignments from the ancestral copy in the reference to both paralogs in query. -1 will only keep the alignments between orthologs, and the duplication will remain unaligned. Both options are heuristics and don't always work perfectly, but can be helpful in reducing the number of repetitive alignments in the output.

In light of this, it seems reasonable to keep using the 1delta file, since ultimately Ding matches duplicates 1-to-1 and treats the rest as indels. Even though 1-to-1 and M-to-M won't necessarily give the same DCJ distance, consider e.g.

genome 1: A B C
genome 2: A B B' C C'

1-to-1 would give DCJ = 2, while M-to-M would give DCJ=1. Based on nucleotide similarity it seems more reasonable to infer that this was two separate duplications, so even though the distance is greater, it's based off of "better" prior information.

It looks like when there's two "best" matches (i.e. they have the same score), then 1-to-1 outputs both of them, so we can handle that ambiguous case ourselves. This makes duplicate matches a special case of overlaps #9

TLDR; we don't need to use the mdelta, as that just makes extra work for us and we lose helpful information for the matching of duplicates.

iqbal-lab commented 10 months ago

Why would M to M give DCJ=1?? Reopening so it is a bit irritating and you remember to tell me on Monday. Am a bit confused

iqbal-lab commented 10 months ago

I mean, I only want to be irritating because it is memorable (not claiming unusual)

iqbal-lab commented 10 months ago

I could have just said can we talk on Monday