Closed colindaven closed 1 year ago
The fact that it outputs alignments is expected, but you should be very careful about trusting those alignments to be correct. In particular regions with a lot of short (less than 30bp) nodes will be systematically underaligned. It's plausible that your graph might not have such problem regions since there's just three genomes, but scaling it up to more genomes will inevitably introduce those kinds of areas. You might want to do some validation before trusting the results, eg. do the reads from one of the three genomes align to nodes in that genome, and do they cover that genome as expected. Since there's a lot of alignments (more than twice the reads) I suspect the results might have some issues.
You can filter the gaf to full length alignments with awk '$3==0 && $4==$2'
, you might also want to filter by mapping quality with awk '$3==0 && $4==$2 && $12>20'
for minimum mapq 20 full length alignments.
Thanks for that guidance, I'll look into the gaf format more thoroughly now.
Your suspicion was warranted - if my rust script is correct, then there were very few end-to-end alignments for these short reads, but MQ (set to 20) was pretty good:
**Results of gaf-filter-rs on SRR2183387_Arabidopsis_1-8_25k_R1.gaf**
* very few end to end aligns
GAF line count:15161
End to end count:85
MQ_pass_counter count:12837
Hi,
I know GraphAligner is a nice tool for long read mapping, but given problems with speed or compatibility with other tools in the graph genomics world, I tried it for short Illumina reads (150bp).
Surprisingly, it worked (I think).
Q1: is this expected ? These are Arabidopsis reads and a small PGGB pangenome of 3 Arabidopsis chromosome 1s.
Q2: is there a way to force keeping only end-to-end alignments ? GAM filtering is difficult / impossible, GAF is probably easier, and I'm running into vg surject errors at the next step (which could have all sorts of reasons, its probably the pangenome).
Thanks