neherlab / pangraph

A bioinformatic toolkit to align genome assemblies into pangenome graphs
https://neherlab.github.io/pangraph
MIT License
77 stars 7 forks source link

fix: mmseqs paf parsing and added tests #69

Closed mmolari closed 5 months ago

mmolari commented 5 months ago

I created some dummy sequences and used mmseqs to align them, and used the results to add tests to the mmseqs paf parsing.

I used the command:

mmseqs easy-search \
  qry.fasta \
  ref.fasta \
  results_mmseqs_rev.paf \
  tmp \
  --threads 1 \
  --max-seq-len 10000 \
  -a \
  --search-type 3 \
  --format-output "query,qlen,qstart,qend,empty,target,tlen,tstart,tend,nident,alnlen,bits,cigar,fident,raw"

A somewhat unexpected behavior is that it seems that the target sequence start and end are always inverted in the output file (mmseqs2 version 15.6f452). I think this should be investigated more but in the meantime I added the tests using the raw output from mmseqs as test input and artificially inverting the ref-sequence start/end when inverted. This is irrespective of whether the match is on the forward/reverse strand.