maickrau / GraphAligner

MIT License
261 stars 32 forks source link

Custom seeds in GAF format #101

Closed nrizzo closed 4 days ago

nrizzo commented 6 months ago

Hello!

Are GAF custom seeds supported when the graph in input is in GFA format? By snooping around the source code and by testing, it seems that hidden option --realign is doing exactly this, but I wonder if there are some gotchas to be aware of.

The only minor thing I've found is that GraphAligner (daec67f) expects seeds to have length of at least 2 (but it's not really a problem):

# seed read\t71\t0\t2\t+\t>1\t44\t3\t5\t0\t0\t255 works, but this one does not:
$ GraphAligner -b 10 -g test/graph.gfa -f test/read.fa -a test/aln.gaf --realign <(echo -e "read\t71\t0\t1\t+\t>1\t44\t3\t4\t0\t0\t255")
GraphAligner Branch master commit daec67f67a2f50d648a6aa30cbbe5a2949583061 2024-01-19 10:52:13 +0200
GraphAligner Branch master commit daec67f67a2f50d648a6aa30cbbe5a2949583061 2024-01-19 10:52:13 +0200
Load graph from test/graph.gfa
Build alignment graph
Seeds from file
Seed cluster size 1
Extend up to 5 seed clusters
Alignment bandwidth 10
Clip alignment ends with identity < 66%
X-drop DP score cutoff 14705
Backtrace from 10 highest scoring local maxima per cluster
write alignments to test/aln.gaf
Align
src/GraphAligner.h:179: Assertion 'thisEnd > thisStart' failed. Read: read. Seed: 0+,0,0,0
Alignment finished
Input reads: 1 (71bp)
Seeds found: 0
Seeds extended: 0
Reads with a seed: 0 (0bp)
Reads with an alignment: 0 (0bp)
Alignments: 0 (0bp)
End-to-end alignments: 0 (0bp)
Alignment broke with some reads. Look at stderr output.

Thanks! ~Nicola

maickrau commented 3 weeks ago

Hi, the hidden option --realign should indeed do this. The output is not constrained to actually respect the original input alignments in any way, it just creates seed hits for the first and last matching base pairs of the input alignments and then lets the rest of the alignment method run. The output might have a different number of alignments (eg discarding alignments as secondary) and the output alignments might be inconsistent with the input alignments, eg taking different nodes in the parts of the read covered by both input and output alignment. The option is hidden because it was meant for debugging and development.