waveygang / wfmash

base-accurate DNA sequence alignments using WFA and mashmap3
MIT License
177 stars 19 forks source link

Is linear synteny assumed? #233

Open rsharris opened 7 months ago

rsharris commented 7 months ago

I am new to wfmash (but of course have plenty of alignment experience).

I tried an experiment in which I generated a random genome and a second with rearrangements (including random reversals) of items from the first with various lengths (from 1K to 10K) and divergences between 0 and 20%. No indels (other than the rearrangements). There are no duplications — and every bit of each genome is homologous to a unique bit of the other.

I tried aligning using wfmash apple.fa orange.fa --map-pct-id=70 and no alignments at all were reported. Total genome length was ≈400K.

I should note that I built wfmash from a clone of just a couple days ago (v0.13.0-3-gc18520b), and that I have run a few other experiments mapping similarly diverged items, but as separate reads, to the random genome, and I do get alignments in that case (though not as many as I might expect).

I notice in issue #161 an Oct/24/2022 post that shows a dot plot of a whole genome alignment that looks mostly syntenic. Which makes me wonder if wfmash assumes end-to-end synteny.

More generally I'm trying to understand how I need to parameterize wfmash to find all the homologies in my sequences, and to understand what its limitations are w.r.t. divergence and lengths.

AndreaGuarracino commented 7 months ago

Hi Bob, could you share you example?

wfmash first runs a similarity search (with MashMap3) to find similar regions between the input sequences, and then apply end-to-end alignment (hierarchical wavefront alignment) only to those.

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg


From: Bob Harris @.> Sent: Saturday, April 6, 2024 10:26:36 AM To: waveygang/wfmash @.> Cc: Subscribed @.***> Subject: [waveygang/wfmash] Is linear synteny assumed? (Issue #233)

I am new to wfmash (but of course have plenty of alignment experience).

I tried an experiment in which I generated a random genome and a second with rearrangements (including random reversals) of items from the first with various lengths (from 1K to 10K) and divergences between 0 and 20%. No indels (other than the rearrangements). There are no duplications ― and every bit of each genome is homologous to a unique bit of the other.

I tried aligning using wfmash apple.fa orange.fa --map-pct-id=70 and no alignments at all were reported. Total genome length was ≈400K.

I should note that I built wfmash from a clone of just a couple days ago (v0.13.0-3-gc18520b), and that I have run a few other experiments mapping similarly diverged items, but as separate reads, to the random genome, and I do get alignments in that case (though not as many as I might expect).

I notice in issue #161https://github.com/waveygang/wfmash/issues/161 an Oct/24/2022 post that shows a dot plot of a whole genome alignment that looks mostly syntenic. Which makes me wonder if wfmash assumes end-to-end synteny.

More generally I'm trying to understand how I need to parameterize wfmash to find all the homologies in my sequences, and to understand what its limitations are w.r.t. divergence and lengths.

― Reply to this email directly, view it on GitHubhttps://github.com/waveygang/wfmash/issues/233, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AO26XHQCKBAAK33G436JYZTY4AAZZAVCNFSM6AAAAABF2PSZPGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZDSMRXGIYTONI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

rsharris commented 7 months ago

@AndreaGuarracino I've placed the example here: https://drive.google.com/drive/folders/1z69e_uTaYdMjEdP9m5SGf2CPaEm3KiO8?usp=share_link

apple.fa and orange.fa are the fake genomes. barcodes_locations.dat gives hints about the embedded homologies. For example, BARCODE_id97_len10000 APPLE + 66985 ORANGE - 8985 indicates that a 10K segment with identity ≈97% is (roughly) centered at 66985 in apple.fa and 8985 in orange.fa (reverse-complemented).

bkille commented 7 months ago

Hi @rsharris,

It looks like the cause of most of the missing mappings is that the colinear chains identified by MashMap are too short to pass the "block-length filter."

        -s[N], --segment-length=[N]       segment seed length for mapping [default: 5k]
        -l[N], --block-length=[N]         keep merged mappings supported by homologies of this total
                                          length [default: 5*segment-length]

So in your example, all colinear chains w/ block-length < 25k are being filter out.

Keep in mind that even if you turn the block-length filter off with -l 0, the default "segment-length" for MashMap is 5k, so if a homology is < 10k, you're still not necessarily guaranteed to find it. For example, in the worst case we could have a reference ABCCD, where each letter is a 2.5kbp region labeled by the homology group, and a query XCCY. In MashMap, the query will be split into two segments: XC and CY. Both XC and CY have a maximum mapping identify of 50%, so no mappings will be reported even though there is a homologous region of 5kbp in the query.

You can set the paramters of wfmash to be more sensitive by decreasing the segment-length and turning off the block-length filter. You can also skip the alignment step with -m (this is essentially the same as just running MashMap). Here is what I get if I run the following:

wfmash ~/Sandbox/apple.fa ~/Sandbox/orange.fa -m -l 0 -s 500 --map-pct-id=70
ORANGE  378000  0       14000   -       APPLE   378000  62010   74999   34      14000   15      id:f:0.970926   kc:f:0.938369
ORANGE  378000  14000   24000   +       APPLE   378000  25988   35999   43      10011   20      id:f:0.990037   kc:f:0.969004
ORANGE  378000  24500   34000   +       APPLE   378000  296676  306047  3       9500    8       id:f:0.848691   kc:f:0.960166
ORANGE  378000  34000   36000   +       APPLE   378000  252847  255118  2       2271    9       id:f:0.860982   kc:f:1.0003
ORANGE  378000  36000   46000   -       APPLE   378000  224015  233929  7       10000   9       id:f:0.880723   kc:f:0.955113
ORANGE  378000  46000   56000   +       APPLE   378000  79909   90003   40      10094   14      id:f:0.955574   kc:f:0.949293
ORANGE  378000  56000   66000   -       APPLE   378000  151986  162069  9       10083   11      id:f:0.921719   kc:f:0.967208
ORANGE  378000  66000   67000   +       APPLE   378000  0       996     67      1000    255     id:f:1  kc:f:0.952487
ORANGE  378000  67000   77000   +       APPLE   378000  241880  251824  2       10000   9       id:f:0.878268   kc:f:0.943414
ORANGE  378000  94000   96000   -       APPLE   378000  163023  165022  7       2000    10      id:f:0.9097     kc:f:0.961011
ORANGE  378000  96500   101000  -       APPLE   378000  309086  313522  1       4500    8       id:f:0.836647   kc:f:0.972789
ORANGE  378000  101000  106000  +       APPLE   378000  2994    7997    67      5003    255     id:f:1  kc:f:0.967899
ORANGE  378000  106000  107000  +       APPLE   378000  341943  342721  3       1000    8       id:f:0.854658   kc:f:1.22786
ORANGE  378000  107000  108000  -       APPLE   378000  72003   73012   21      1009    14      id:f:0.963555   kc:f:0.923736
ORANGE  378000  108000  109000  -       APPLE   378000  270020  270792  2       1000    9       id:f:0.860807   kc:f:0.894613
ORANGE  378000  110000  114000  -       APPLE   378000  273198  277143  3       4000    9       id:f:0.862408   kc:f:1.01202
ORANGE  378000  114000  119000  +       APPLE   378000  74971   79998   32      5027    14      id:f:0.960263   kc:f:0.958075
ORANGE  378000  119000  120000  -       APPLE   378000  107995  109015  12      1020    11      id:f:0.924284   kc:f:0.959393
ORANGE  378000  122000  127000  +       APPLE   378000  38980   44004   39      5024    17      id:f:0.977686   kc:f:0.950008
ORANGE  378000  127500  132000  +       APPLE   378000  327101  331941  2       4840    8       id:f:0.840674   kc:f:0.96156
ORANGE  378000  132000  134000  +       APPLE   378000  144925  146985  5       2060    11      id:f:0.9245     kc:f:1.06127
ORANGE  378000  134500  135000  -       APPLE   378000  215988  216488  3       500     9       id:f:0.878709   kc:f:0.947062
ORANGE  378000  135000  149000  -       APPLE   378000  108957  125988  19      17031   12      id:f:0.931982   kc:f:0.998222
ORANGE  378000  149000  159000  -       APPLE   378000  43988   53979   37      10000   17      id:f:0.978063   kc:f:0.925722
ORANGE  378000  159000  169000  -       APPLE   378000  7999    18000   67      10001   255     id:f:1  kc:f:0.956995
ORANGE  378000  169000  169500  -       APPLE   378000  360451  360951  1       500     8       id:f:0.830608   kc:f:1.06221
ORANGE  378000  170000  172000  -       APPLE   378000  234891  237098  8       2207    10      id:f:0.899604   kc:f:0.996349
ORANGE  378000  172000  189000  +       APPLE   378000  324908  340813  1       17000   8       id:f:0.839747   kc:f:1.01342
ORANGE  378000  190000  195000  +       APPLE   378000  20999   25990   58      5000    20      id:f:0.989183   kc:f:0.966632
ORANGE  378000  195000  196000  -       APPLE   378000  89963   91039   6       1076    12      id:f:0.93021    kc:f:0.945178
ORANGE  378000  196000  206000  -       APPLE   378000  259956  269944  7       10000   9       id:f:0.871032   kc:f:0.965406
ORANGE  378000  206000  216000  +       APPLE   378000  206084  216081  3       10000   9       id:f:0.883693   kc:f:0.940492
ORANGE  378000  216000  226000  -       APPLE   378000  169910  180032  8       10122   10      id:f:0.904045   kc:f:0.976999
ORANGE  378000  226000  245000  +       APPLE   378000  126008  151963  10      25955   11      id:f:0.92342    kc:f:0.993916
ORANGE  378000  245000  255000  +       APPLE   378000  97980   107982  22      10002   13      id:f:0.950799   kc:f:0.980349
ORANGE  378000  255000  257000  -       APPLE   378000  216813  219009  7       2196    11      id:f:0.918586   kc:f:0.989899
ORANGE  378000  257000  259000  +       APPLE   378000  198999  200995  2       2000    9       id:f:0.885078   kc:f:1.0498
ORANGE  378000  270500  275000  -       APPLE   378000  291088  295361  4       4500    9       id:f:0.882286   kc:f:0.973189
ORANGE  378000  275000  276000  -       APPLE   378000  162002  162828  5       1000    9       id:f:0.881062   kc:f:1.07553
ORANGE  378000  276000  281000  +       APPLE   378000  183035  187755  5       5000    10      id:f:0.908737   kc:f:0.962666
ORANGE  378000  281000  283000  +       APPLE   378000  90990   93014   24      2024    14      id:f:0.955881   kc:f:1.01235
ORANGE  378000  283000  285000  +       APPLE   378000  342590  345120  1       2530    8       id:f:0.845707   kc:f:1.08461
ORANGE  378000  285000  286000  -       APPLE   378000  17997   19000   40      1003    20      id:f:0.989152   kc:f:0.898151
ORANGE  378000  286000  291000  +       APPLE   378000  128986  134007  18      5021    12      id:f:0.929507   kc:f:1.0129
ORANGE  378000  291000  297000  -       APPLE   378000  371941  377760  4       6000    9       id:f:0.866848   kc:f:0.942814
ORANGE  378000  301000  317500  -       APPLE   378000  252520  259949  2       16500   9       id:f:0.885796   kc:f:1.01906
ORANGE  378000  318000  327500  +       APPLE   378000  350025  359080  3       9500    8       id:f:0.851078   kc:f:0.96004
ORANGE  378000  328500  330000  -       APPLE   378000  271023  272514  1       1500    8       id:f:0.840674   kc:f:1.01111
ORANGE  378000  330500  331500  +       APPLE   378000  289600  290511  4       1000    10      id:f:0.891449   kc:f:0.962764
ORANGE  378000  332000  342000  -       APPLE   378000  133873  144013  10      10140   12      id:f:0.933459   kc:f:0.960952
ORANGE  378000  342000  347000  -       APPLE   378000  237064  242036  2       5000    9       id:f:0.873489   kc:f:0.978416
ORANGE  378000  347000  352000  +       APPLE   378000  219132  224053  4       5000    10      id:f:0.888701   kc:f:1.02499
ORANGE  378000  352000  353000  +       APPLE   378000  179999  181014  17      1015    13      id:f:0.948037   kc:f:0.944139
ORANGE  378000  353000  354000  +       APPLE   378000  54005   55009   24      1004    15      id:f:0.970361   kc:f:0.884885
ORANGE  378000  354000  359000  +       APPLE   378000  110997  116000  16      5003    13      id:f:0.948956   kc:f:0.973016
ORANGE  378000  359500  368000  +       APPLE   378000  278486  287219  2       8733    9       id:f:0.859774   kc:f:1.02027
ORANGE  378000  369000  372000  +       APPLE   378000  363170  365694  3       3000    8       id:f:0.842633   kc:f:0.933087
ORANGE  378000  374000  375000  +       APPLE   378000  35969   36980   20      1011    15      id:f:0.969148   kc:f:1.08341
ORANGE  378000  375000  376000  -       APPLE   378000  287936  288818  1       1000    8       id:f:0.845707   kc:f:0.950807
ORANGE  378000  377000  377500  +       APPLE   378000  233937  234437  8       500     11      id:f:0.921907   kc:f:0.949565
rsharris commented 7 months ago

Thanks, @bkille , that's helpful, and it's the kind of insight I'm trying to gather, from a user perspective. I'm running some additional experiments with randoms and expect to have some more observations in later this afternoon.

rsharris commented 7 months ago

Based on my (admittedly very limited) experiments, I think the statement "wfmash is designed ... whole genome alignment ... can handle ... average nucleotide identity ... as low as 70%" is misleading.

As currently written (in the README), a naive interpretation is wfmash it will discover all homologies with at ANI≥70%. But in reality --map-pct-id=70 just sets the ANI threshold for the mashmap step. And my simple experiments appear to show the discovery threshold for that step is somewhat higher than 70%.

I ran the same experiment as earlier, except I allowed identity to range from 70% to 100%. I ran wfmash 20240406E.apple.fa 20240406E.orange.fa --map-pct-id=70 --segment-length=500 --block-length=0. Most of the induced homologies below 80% ANI went undiscovered. (In more detail, there we 40 such homologies, only 9 were discovered).

Probably if the user is familiar with mashmap they would understand these limitations (or how to tweak the params to avoid them). But this doesn't seem very clear from the current README. I get the sense, now, that my expectations were higher than were warranted, having interpreted that paragraph as "fast whole genome alignment with identity as low as 70%" and ignoring the rest of that section.

For what it's worth, the new experiment's data is in the same place as before: https://drive.google.com/drive/folders/1z69e_uTaYdMjEdP9m5SGf2CPaEm3KiO8?usp=share_link

bkille commented 7 months ago

Thanks for sharing the data. That's a fair point that the README right now could use some more detail/instruction. Arguably, the bigger issue here is that I haven't set up the k-mer size to be automatically adjusted based on the minimum-identity threshold. Right now, only the sketch-size is adjusted automatically.

As far as the test dataset goes, I would say that this one is particularly challenging even for 70% ANI due to the lack of synteny 😅 FWIW, by dropping the kmer size to 13 it looks like most of the homologies are recovered (and with much more accurate ANI predictions).

Also, this helped me notice that we should likely be adjusting the "chaining gap" to by dynamic w/ the size of the segments, so thank you for that as well!

rsharris commented 7 months ago

Regarding lack of synteny — it certainly is a contrived dataset. I intentionally made sure none of the 'planted' homologies had the same neighbors in both sequences. My interest was in seeing how well each homology would be discovered on its own. Would real data ever look like that? Probably not. But I do recall drosophila having a high degree of rearrangement, but perhaps not with such short pieces.

Probably should change the title of this issue to something more meaningful.

bkille commented 7 months ago

Regarding lack of synteny — it certainly is a contrived dataset. I intentionally made sure none of the 'planted' homologies had the same neighbors in both sequences. My interest was in seeing how well each homology would be discovered on its own. Would real data ever look like that? Probably not. But I do recall drosophila having a high degree of rearrangement, but perhaps not with such short pieces.

Contrived or not, it helped bring to light/remind me of some issues that I need to address with MashMap integration before the "1.0.0" release, so thank you for that.