lh3 / minigraph

Sequence-to-graph mapper and graph generator
https://lh3.github.io/minigraph
MIT License
405 stars 41 forks source link

Valid paths in a GFA don't have end-to-end alignments #103

Open rlorigro opened 1 year ago

rlorigro commented 1 year ago

Hi,

I am cross-posting this issue which I have submitted to the PanAligner repository, since it has been found by the authors to be a result of the base-level alignment step taken from minigraph/gwfa.

In summary, if I take a simple 3 node graph in which the middle node could be viewed as an insertion, and I concatenate the three nodes as a query, minigraph fails to find an end-to-end alignment. The shortest node is 38bp.

Thanks

PS: to clarify I've run the exact same test with minigraph and found the same results

rlorigro commented 1 month ago

Just reposting the details here for better visibility. This seems like a pretty serious bug to me.

Here is the GFA:

S   1   tcatggtacttgtcaccttctgtcctatttgagtcatgatgcatttgtctatcatctatctccctgcaatgaacatccctggggtggggccagggtctgttttgcttggtgctgcattcccattgccaagaagacctcacaggcatatcaagcacacagcaggtgcataacacttgctgagtttgctgaatgaatCCCTCCTCTGTATCCTACACATCCAACTGGTCACCACGTCCTGTGCGGTAGATGTCCTTTTTAGCAGGTCCCCTGCACTTCACTGTGTTTGTTAAGCCTTGGCTTGCTCTGtcatttcattttgcttaggTGGCGGCAGCAGCCCCTGACAGCCCCTGGCCTCACAGCCCCTGGCCTCAACCCATATCCTCTCCACTCTGCACACAGCTGCTGGGAGACTTTTCCTTAAAACTtggtttctctcctttcccttctgaAAACTGTCAGTGGCTTCCTGTGGCTCAGAGGAATCAATCGAAACATGGTGCTGCTCACAAGGAAACACAACGTTTCTAGGGCCAGTGCTGTCTTTGTCCTGCCTTCCTCCAAATTCAGCCACGTTAAGCCTCAAGCAGATCCTCCCACACAGCTGGGCTTCTTCTGCCTTTCTGCCAGGACTTGCTGCTCCCAGGAGCCTCCTCCCCCAGTCCCTGCAGCCCCAGGTCAGGTGTCTCCCCACCCAGCCTGGCTCAGCTGTCCTTTCCCGTGCAACTACAGGTTCATACAATTTGCTATTGCTCCATCTACTCTATGTGACTCTTGGTTTCTTGAAGGCAGGTATGggactctttctttccttctttgctttcttttctctttctctttctttctttttctttctttctttctttctttc
S   2   tttctttctttctttctttctttctttctttctttctt
S   3   tctctctctctctctctcttttcttttctttctttctctctttctgtttttagagactgggtcttgctgtcacccaggttggagtgcagtggtgcaatcatggctcaccgaactcctgggctcaagtgagcctcttgcctcagcctcccaattagttgggactacaggcatgtgccactacacctggctaattgttattattattattattattattattatttttgtaaagacagggtcttgctctgtttcctaggctggtcttgaacccctggcctcaaatgatcctcctgcctcagcctcccaaagtgctgggattccaggagtaagataccatgtctggccTGAGAAATTTTCTAAAAGGCATGTTTTTGCACTGtccaatatgtttttctttttatcccagcctctagcacaagtgcctgactcgaacgaagcagggactctgaaatatcttgaatgaacaagtAAATGGCACTTGAATAGGTGGCCTCTAATGtgcagaggaaggaaaagggagctgacgctttccgagtgttcagtaccttctaggcaccatgctgggccttttatgtagttctcgatgaatcctcatggcatcccGCTTTGCAGAATGATTCTAGTTACCCAGAGAGCAAGTAGCCCTTAGGGTCAGGACATGCATCTGAGTTGGTCTAGTCCCAAAGCTCCTGGTCCTCTCCTGACATCActtaaacagagaaacagaactccCCTTGGGCTTCTAGGGGCGCTGGGTTCAGGAGGCACAGCCACTCCCTTTGTTCTTCCTGGCAGCTGCCCCACCAGCAGTGAGCCCATCCCacctctgggttttcttttt
L   1   +   2   +   0M
L   1   +   3   +   0M
L   2   +   3   +   0M

When I concatenate 1,2,3 or 1,3 and align them to the GFA, I never get an end-to-end alignment, and node 2 is always missing:

1       871     1       871     +       >1      871     1       871     870     870     60      tp:A:P  NM:i:0  cm:i:51 s1:i:714        s2:i:0  dv:f:0.0756     cg:Z:870=
3       841     5       836     +       >3      841     5       836     831     831     60      tp:A:P  NM:i:0  cm:i:48 s1:i:672        s2:i:0  dv:f:0.0790     cg:Z:831=
1_2_3   1750    1       857     +       >1      871     1       857     856     856     60      tp:A:P  NM:i:0  cm:i:50 s1:i:714        s2:i:0  dv:f:0.0751     cg:Z:856=
1_2_3   1750    930     1745    +       >3      841     21      836     815     815     60      tp:A:P  NM:i:0  cm:i:47 s1:i:671        s2:i:0  dv:f:0.0795     cg:Z:815=
1_3     1712    1       857     +       >1      871     1       857     856     856     60      tp:A:P  NM:i:0  cm:i:50 s1:i:700        s2:i:0  dv:f:0.0751     cg:Z:856=
1_3     1712    892     1707    +       >3      841     21      836     815     815     60      tp:A:P  NM:i:0  cm:i:47 s1:i:672        s2:i:0  dv:f:0.0795     cg:Z:815=

Here is a visualization showing how neither the graph nor the linear query are fully covered: image

I have tried multiple different combinations of parameters to attempt to find seeds/minimizers in node 2:

Defaults:

PanAligner -c -x lr -o /tmp/tmpn2oq6iyx/reads_vs_graph.gaf bad_tandem_graph.gfa bad_tandem_nodes.fasta
/tmp/tmpn2oq6iyx/reads_vs_graph.gaf

Permissive:

PanAligner -c -k 14 -n 1,1 -m 1,1 -x lr -o /tmp/tmpyid87tic/reads_vs_graph.gaf bad_tandem_graph.gfa bad_tandem_nodes.fasta
/tmp/tmpyid87tic/reads_vs_graph.gaf

I've also tried increasing the -g and -r parameters to allow the extension step to find more alignments, but that does not seem to have any effect. Alternatively, extreme 0/1 values of -f also have no effect.

Could you help me with this case? I am having difficulty understanding why the base level alignment step doesn't bridge the chains here. Ideally, I would use base level alignment for a majority of the work, and only rely on minimizers for anchoring the flanks of the tandem repeat.

Thanks