maickrau / GraphAligner

MIT License
255 stars 30 forks source link

short reads / end to end #79

Closed colindaven closed 1 year ago

colindaven commented 1 year ago

Hi,

I know GraphAligner is a nice tool for long read mapping, but given problems with speed or compatibility with other tools in the graph genomics world, I tried it for short Illumina reads (150bp).

Surprisingly, it worked (I think).

Q1: is this expected ? These are Arabidopsis reads and a small PGGB pangenome of 3 Arabidopsis chromosome 1s.

Q2: is there a way to force keeping only end-to-end alignments ? GAM filtering is difficult / impossible, GAF is probably easier, and I'm running into vg surject errors at the next step (which could have all sorts of reasons, its probably the pangenome).

Thanks


Alignment finished
Input reads: 25000 (2500000bp)
Seeds found: 544641
Seeds extended: 99650
Reads with a seed: 23816 (2381600bp)
Reads with an alignment: 23740 (1646863bp)
Alignments: 68662 (3666323bp) (41028 additional alignments discarded)
End-to-end alignments: 22214 (2221400bp)

Gam stats

SRR2183387_Arabidopsis_1-8_25k_R1.gam.stats.txt
Total alignments: 68662
Total primary: 68662
Total secondary: 0
Total aligned: 63661
Total perfect: 0
Total gapless (softclips allowed): 45785
Total paired: 0
Total properly paired: 0
Insertions: 21343 bp in 19272 read events
Deletions: 8930 bp in 7738 read events
Substitutions: 224147 bp in 200096 read events
Softclips: 0 bp in 0 read events
maickrau commented 1 year ago
  1. The fact that it outputs alignments is expected, but you should be very careful about trusting those alignments to be correct. In particular regions with a lot of short (less than 30bp) nodes will be systematically underaligned. It's plausible that your graph might not have such problem regions since there's just three genomes, but scaling it up to more genomes will inevitably introduce those kinds of areas. You might want to do some validation before trusting the results, eg. do the reads from one of the three genomes align to nodes in that genome, and do they cover that genome as expected. Since there's a lot of alignments (more than twice the reads) I suspect the results might have some issues.

  2. You can filter the gaf to full length alignments with awk '$3==0 && $4==$2', you might also want to filter by mapping quality with awk '$3==0 && $4==$2 && $12>20' for minimum mapq 20 full length alignments.

colindaven commented 1 year ago

Thanks for that guidance, I'll look into the gaf format more thoroughly now.

Your suspicion was warranted - if my rust script is correct, then there were very few end-to-end alignments for these short reads, but MQ (set to 20) was pretty good:

**Results of gaf-filter-rs on SRR2183387_Arabidopsis_1-8_25k_R1.gaf**
* very few end to end aligns

GAF line count:15161
End to end count:85
MQ_pass_counter count:12837