Open rsharris opened 7 months ago
Hi Bob, could you share you example?
wfmash first runs a similarity search (with MashMap3) to find similar regions between the input sequences, and then apply end-to-end alignment (hierarchical wavefront alignment) only to those.
Sent from Outlook for Androidhttps://aka.ms/AAb9ysg
From: Bob Harris @.> Sent: Saturday, April 6, 2024 10:26:36 AM To: waveygang/wfmash @.> Cc: Subscribed @.***> Subject: [waveygang/wfmash] Is linear synteny assumed? (Issue #233)
I am new to wfmash (but of course have plenty of alignment experience).
I tried an experiment in which I generated a random genome and a second with rearrangements (including random reversals) of items from the first with various lengths (from 1K to 10K) and divergences between 0 and 20%. No indels (other than the rearrangements). There are no duplications ― and every bit of each genome is homologous to a unique bit of the other.
I tried aligning using wfmash apple.fa orange.fa --map-pct-id=70 and no alignments at all were reported. Total genome length was ≈400K.
I should note that I built wfmash from a clone of just a couple days ago (v0.13.0-3-gc18520b), and that I have run a few other experiments mapping similarly diverged items, but as separate reads, to the random genome, and I do get alignments in that case (though not as many as I might expect).
I notice in issue #161https://github.com/waveygang/wfmash/issues/161 an Oct/24/2022 post that shows a dot plot of a whole genome alignment that looks mostly syntenic. Which makes me wonder if wfmash assumes end-to-end synteny.
More generally I'm trying to understand how I need to parameterize wfmash to find all the homologies in my sequences, and to understand what its limitations are w.r.t. divergence and lengths.
― Reply to this email directly, view it on GitHubhttps://github.com/waveygang/wfmash/issues/233, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AO26XHQCKBAAK33G436JYZTY4AAZZAVCNFSM6AAAAABF2PSZPGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZDSMRXGIYTONI. You are receiving this because you are subscribed to this thread.Message ID: @.***>
@AndreaGuarracino I've placed the example here: https://drive.google.com/drive/folders/1z69e_uTaYdMjEdP9m5SGf2CPaEm3KiO8?usp=share_link
apple.fa and orange.fa are the fake genomes. barcodes_locations.dat gives hints about the embedded homologies. For example, BARCODE_id97_len10000 APPLE + 66985 ORANGE - 8985
indicates that a 10K segment with identity ≈97% is (roughly) centered at 66985 in apple.fa and 8985 in orange.fa (reverse-complemented).
Hi @rsharris,
It looks like the cause of most of the missing mappings is that the colinear chains identified by MashMap are too short to pass the "block-length filter."
-s[N], --segment-length=[N] segment seed length for mapping [default: 5k]
-l[N], --block-length=[N] keep merged mappings supported by homologies of this total
length [default: 5*segment-length]
So in your example, all colinear chains w/ block-length < 25k are being filter out.
Keep in mind that even if you turn the block-length filter off with -l 0
, the default "segment-length" for MashMap is 5k, so if a homology is < 10k, you're still not necessarily guaranteed to find it. For example, in the worst case we could have a reference ABCCD
, where each letter is a 2.5kbp region labeled by the homology group, and a query XCCY
. In MashMap, the query will be split into two segments: XC
and CY
. Both XC
and CY
have a maximum mapping identify of 50%, so no mappings will be reported even though there is a homologous region of 5kbp in the query.
You can set the paramters of wfmash
to be more sensitive by decreasing the segment-length and turning off the block-length filter. You can also skip the alignment step with -m
(this is essentially the same as just running MashMap). Here is what I get if I run the following:
wfmash ~/Sandbox/apple.fa ~/Sandbox/orange.fa -m -l 0 -s 500 --map-pct-id=70
ORANGE 378000 0 14000 - APPLE 378000 62010 74999 34 14000 15 id:f:0.970926 kc:f:0.938369
ORANGE 378000 14000 24000 + APPLE 378000 25988 35999 43 10011 20 id:f:0.990037 kc:f:0.969004
ORANGE 378000 24500 34000 + APPLE 378000 296676 306047 3 9500 8 id:f:0.848691 kc:f:0.960166
ORANGE 378000 34000 36000 + APPLE 378000 252847 255118 2 2271 9 id:f:0.860982 kc:f:1.0003
ORANGE 378000 36000 46000 - APPLE 378000 224015 233929 7 10000 9 id:f:0.880723 kc:f:0.955113
ORANGE 378000 46000 56000 + APPLE 378000 79909 90003 40 10094 14 id:f:0.955574 kc:f:0.949293
ORANGE 378000 56000 66000 - APPLE 378000 151986 162069 9 10083 11 id:f:0.921719 kc:f:0.967208
ORANGE 378000 66000 67000 + APPLE 378000 0 996 67 1000 255 id:f:1 kc:f:0.952487
ORANGE 378000 67000 77000 + APPLE 378000 241880 251824 2 10000 9 id:f:0.878268 kc:f:0.943414
ORANGE 378000 94000 96000 - APPLE 378000 163023 165022 7 2000 10 id:f:0.9097 kc:f:0.961011
ORANGE 378000 96500 101000 - APPLE 378000 309086 313522 1 4500 8 id:f:0.836647 kc:f:0.972789
ORANGE 378000 101000 106000 + APPLE 378000 2994 7997 67 5003 255 id:f:1 kc:f:0.967899
ORANGE 378000 106000 107000 + APPLE 378000 341943 342721 3 1000 8 id:f:0.854658 kc:f:1.22786
ORANGE 378000 107000 108000 - APPLE 378000 72003 73012 21 1009 14 id:f:0.963555 kc:f:0.923736
ORANGE 378000 108000 109000 - APPLE 378000 270020 270792 2 1000 9 id:f:0.860807 kc:f:0.894613
ORANGE 378000 110000 114000 - APPLE 378000 273198 277143 3 4000 9 id:f:0.862408 kc:f:1.01202
ORANGE 378000 114000 119000 + APPLE 378000 74971 79998 32 5027 14 id:f:0.960263 kc:f:0.958075
ORANGE 378000 119000 120000 - APPLE 378000 107995 109015 12 1020 11 id:f:0.924284 kc:f:0.959393
ORANGE 378000 122000 127000 + APPLE 378000 38980 44004 39 5024 17 id:f:0.977686 kc:f:0.950008
ORANGE 378000 127500 132000 + APPLE 378000 327101 331941 2 4840 8 id:f:0.840674 kc:f:0.96156
ORANGE 378000 132000 134000 + APPLE 378000 144925 146985 5 2060 11 id:f:0.9245 kc:f:1.06127
ORANGE 378000 134500 135000 - APPLE 378000 215988 216488 3 500 9 id:f:0.878709 kc:f:0.947062
ORANGE 378000 135000 149000 - APPLE 378000 108957 125988 19 17031 12 id:f:0.931982 kc:f:0.998222
ORANGE 378000 149000 159000 - APPLE 378000 43988 53979 37 10000 17 id:f:0.978063 kc:f:0.925722
ORANGE 378000 159000 169000 - APPLE 378000 7999 18000 67 10001 255 id:f:1 kc:f:0.956995
ORANGE 378000 169000 169500 - APPLE 378000 360451 360951 1 500 8 id:f:0.830608 kc:f:1.06221
ORANGE 378000 170000 172000 - APPLE 378000 234891 237098 8 2207 10 id:f:0.899604 kc:f:0.996349
ORANGE 378000 172000 189000 + APPLE 378000 324908 340813 1 17000 8 id:f:0.839747 kc:f:1.01342
ORANGE 378000 190000 195000 + APPLE 378000 20999 25990 58 5000 20 id:f:0.989183 kc:f:0.966632
ORANGE 378000 195000 196000 - APPLE 378000 89963 91039 6 1076 12 id:f:0.93021 kc:f:0.945178
ORANGE 378000 196000 206000 - APPLE 378000 259956 269944 7 10000 9 id:f:0.871032 kc:f:0.965406
ORANGE 378000 206000 216000 + APPLE 378000 206084 216081 3 10000 9 id:f:0.883693 kc:f:0.940492
ORANGE 378000 216000 226000 - APPLE 378000 169910 180032 8 10122 10 id:f:0.904045 kc:f:0.976999
ORANGE 378000 226000 245000 + APPLE 378000 126008 151963 10 25955 11 id:f:0.92342 kc:f:0.993916
ORANGE 378000 245000 255000 + APPLE 378000 97980 107982 22 10002 13 id:f:0.950799 kc:f:0.980349
ORANGE 378000 255000 257000 - APPLE 378000 216813 219009 7 2196 11 id:f:0.918586 kc:f:0.989899
ORANGE 378000 257000 259000 + APPLE 378000 198999 200995 2 2000 9 id:f:0.885078 kc:f:1.0498
ORANGE 378000 270500 275000 - APPLE 378000 291088 295361 4 4500 9 id:f:0.882286 kc:f:0.973189
ORANGE 378000 275000 276000 - APPLE 378000 162002 162828 5 1000 9 id:f:0.881062 kc:f:1.07553
ORANGE 378000 276000 281000 + APPLE 378000 183035 187755 5 5000 10 id:f:0.908737 kc:f:0.962666
ORANGE 378000 281000 283000 + APPLE 378000 90990 93014 24 2024 14 id:f:0.955881 kc:f:1.01235
ORANGE 378000 283000 285000 + APPLE 378000 342590 345120 1 2530 8 id:f:0.845707 kc:f:1.08461
ORANGE 378000 285000 286000 - APPLE 378000 17997 19000 40 1003 20 id:f:0.989152 kc:f:0.898151
ORANGE 378000 286000 291000 + APPLE 378000 128986 134007 18 5021 12 id:f:0.929507 kc:f:1.0129
ORANGE 378000 291000 297000 - APPLE 378000 371941 377760 4 6000 9 id:f:0.866848 kc:f:0.942814
ORANGE 378000 301000 317500 - APPLE 378000 252520 259949 2 16500 9 id:f:0.885796 kc:f:1.01906
ORANGE 378000 318000 327500 + APPLE 378000 350025 359080 3 9500 8 id:f:0.851078 kc:f:0.96004
ORANGE 378000 328500 330000 - APPLE 378000 271023 272514 1 1500 8 id:f:0.840674 kc:f:1.01111
ORANGE 378000 330500 331500 + APPLE 378000 289600 290511 4 1000 10 id:f:0.891449 kc:f:0.962764
ORANGE 378000 332000 342000 - APPLE 378000 133873 144013 10 10140 12 id:f:0.933459 kc:f:0.960952
ORANGE 378000 342000 347000 - APPLE 378000 237064 242036 2 5000 9 id:f:0.873489 kc:f:0.978416
ORANGE 378000 347000 352000 + APPLE 378000 219132 224053 4 5000 10 id:f:0.888701 kc:f:1.02499
ORANGE 378000 352000 353000 + APPLE 378000 179999 181014 17 1015 13 id:f:0.948037 kc:f:0.944139
ORANGE 378000 353000 354000 + APPLE 378000 54005 55009 24 1004 15 id:f:0.970361 kc:f:0.884885
ORANGE 378000 354000 359000 + APPLE 378000 110997 116000 16 5003 13 id:f:0.948956 kc:f:0.973016
ORANGE 378000 359500 368000 + APPLE 378000 278486 287219 2 8733 9 id:f:0.859774 kc:f:1.02027
ORANGE 378000 369000 372000 + APPLE 378000 363170 365694 3 3000 8 id:f:0.842633 kc:f:0.933087
ORANGE 378000 374000 375000 + APPLE 378000 35969 36980 20 1011 15 id:f:0.969148 kc:f:1.08341
ORANGE 378000 375000 376000 - APPLE 378000 287936 288818 1 1000 8 id:f:0.845707 kc:f:0.950807
ORANGE 378000 377000 377500 + APPLE 378000 233937 234437 8 500 11 id:f:0.921907 kc:f:0.949565
Thanks, @bkille , that's helpful, and it's the kind of insight I'm trying to gather, from a user perspective. I'm running some additional experiments with randoms and expect to have some more observations in later this afternoon.
Based on my (admittedly very limited) experiments, I think the statement "wfmash is designed ... whole genome alignment ... can handle ... average nucleotide identity ... as low as 70%" is misleading.
As currently written (in the README), a naive interpretation is wfmash it will discover all homologies with at ANI≥70%. But in reality --map-pct-id=70 just sets the ANI threshold for the mashmap step. And my simple experiments appear to show the discovery threshold for that step is somewhat higher than 70%.
I ran the same experiment as earlier, except I allowed identity to range from 70% to 100%. I ran wfmash 20240406E.apple.fa 20240406E.orange.fa --map-pct-id=70 --segment-length=500 --block-length=0
. Most of the induced homologies below 80% ANI went undiscovered. (In more detail, there we 40 such homologies, only 9 were discovered).
Probably if the user is familiar with mashmap they would understand these limitations (or how to tweak the params to avoid them). But this doesn't seem very clear from the current README. I get the sense, now, that my expectations were higher than were warranted, having interpreted that paragraph as "fast whole genome alignment with identity as low as 70%" and ignoring the rest of that section.
For what it's worth, the new experiment's data is in the same place as before: https://drive.google.com/drive/folders/1z69e_uTaYdMjEdP9m5SGf2CPaEm3KiO8?usp=share_link
Thanks for sharing the data. That's a fair point that the README right now could use some more detail/instruction. Arguably, the bigger issue here is that I haven't set up the k-mer size to be automatically adjusted based on the minimum-identity threshold. Right now, only the sketch-size is adjusted automatically.
As far as the test dataset goes, I would say that this one is particularly challenging even for 70% ANI due to the lack of synteny 😅 FWIW, by dropping the kmer size to 13 it looks like most of the homologies are recovered (and with much more accurate ANI predictions).
Also, this helped me notice that we should likely be adjusting the "chaining gap" to by dynamic w/ the size of the segments, so thank you for that as well!
Regarding lack of synteny — it certainly is a contrived dataset. I intentionally made sure none of the 'planted' homologies had the same neighbors in both sequences. My interest was in seeing how well each homology would be discovered on its own. Would real data ever look like that? Probably not. But I do recall drosophila having a high degree of rearrangement, but perhaps not with such short pieces.
Probably should change the title of this issue to something more meaningful.
Regarding lack of synteny — it certainly is a contrived dataset. I intentionally made sure none of the 'planted' homologies had the same neighbors in both sequences. My interest was in seeing how well each homology would be discovered on its own. Would real data ever look like that? Probably not. But I do recall drosophila having a high degree of rearrangement, but perhaps not with such short pieces.
Contrived or not, it helped bring to light/remind me of some issues that I need to address with MashMap integration before the "1.0.0" release, so thank you for that.
I am new to wfmash (but of course have plenty of alignment experience).
I tried an experiment in which I generated a random genome and a second with rearrangements (including random reversals) of items from the first with various lengths (from 1K to 10K) and divergences between 0 and 20%. No indels (other than the rearrangements). There are no duplications — and every bit of each genome is homologous to a unique bit of the other.
I tried aligning using
wfmash apple.fa orange.fa --map-pct-id=70
and no alignments at all were reported. Total genome length was ≈400K.I should note that I built wfmash from a clone of just a couple days ago (v0.13.0-3-gc18520b), and that I have run a few other experiments mapping similarly diverged items, but as separate reads, to the random genome, and I do get alignments in that case (though not as many as I might expect).
I notice in issue #161 an Oct/24/2022 post that shows a dot plot of a whole genome alignment that looks mostly syntenic. Which makes me wonder if wfmash assumes end-to-end synteny.
More generally I'm trying to understand how I need to parameterize wfmash to find all the homologies in my sequences, and to understand what its limitations are w.r.t. divergence and lengths.