pangenome / impg

implicit pangenome graph
MIT License
29 stars 5 forks source link

alignments commands #2

Open colindaven opened 3 months ago

colindaven commented 3 months ago

Hi,

this is an interesting tool.

I'm not quite sure how to generate the alignments properly for either wfmash or minimap2.

Would it be something like this ?

# align target vs each genome separately
minimap2 -eqx      ....

# cat the minimap pafs
cat *.paf > combined.paf

# run impg on the whole paf
impg ... combined.paf

Or am I misunderstanding something here ?

Thanks, Colin

ekg commented 2 months ago

It's designed to work with default wfmash output for all-to-all alignment, e.g. with wfmash seqs.fa.gz >aln.paf. It probably will work with minimap2 but it's not been tested.

AndreaGuarracino commented 2 months ago

I've made a few changes (#4) and now we can parse also minimap2's CIGAR strings:

minimap2 scerevisiae7.fa.gz scerevisiae7.fa.gz -X -c -t 48 > mm2.paf

impg -p mm2.paf -r UWOPS034614#1#chrI:1000-2000 | head -n 5 | column -t  
  UWOPS034614#1#chrI  1000    2000
  S288C#1#chrVIII     570385  569601
  DBVPG6765#1#chrVI   5195    5981
  DBVPG6765#1#chrI    210581  209798
  UWOPS034614#1#chrI  213064  212060

impg -p mm2.paf -r UWOPS034614#1#chrI:1000-2000 -P | head -n 5 | column -t 
  UWOPS034614#1#chrI  214332  1000    2000    +  UWOPS034614#1#chrI  214332  1000  2000  1000  1000  255  cg:Z:1000=
  S288C#1#chrVIII     581049  570385  569601  -  UWOPS034614#1#chrI  214332  1000  2000  775   1009  255  cg:Z:10M1I29M2I8M1D11M1D78M18D6M1D5M2D1M3D23M1D6M1D19M1D26M1D27M1D39M5D1M2D19M1D94M1I72M1I21M1I10M1D21M3I22M151D205M1D22M33D
  DBVPG6765#1#chrVI   257436  5195    5981    +  UWOPS034614#1#chrI  214332  1000  2000  777   1009  255  cg:Z:10M1I29M2I8M1D11M1D78M18D6M1D5M2D1M1D25M1D6M1D19M1D26M1D27M1D39M5D1M2D19M1D94M1I72M1I21M1I10M1D21M3I22M151D205M1D22M33D
  DBVPG6765#1#chrI    215496  210581  209798  -  UWOPS034614#1#chrI  214332  1000  2000  774   1009  255  cg:Z:10M1I29M2I8M1D11M1D78M18D6M1D5M6D23M1D6M1D19M1D26M1D27M1D39M5D1M2D19M1D94M1I72M1I21M1I10M1D21M3I22M151D205M1D22M33D
  UWOPS034614#1#chrI  214332  213064  212060  -  UWOPS034614#1#chrI  214332  1000  2000  1000  1004  255  cg:Z:171M4I829M

# --eqx  to write =/X CIGAR operators
minimap2 scerevisiae7.fa.gz scerevisiae7.fa.gz -X -c -t 48 --eqx > mm2.eqx.paf

impg -p mm2.eqx.paf -r UWOPS034614#1#chrI:1000-2000 | head -n 5 | column -t 
  UWOPS034614#1#chrI  1000    2000
  S288C#1#chrVIII     570385  569601
  DBVPG6765#1#chrVI   5195    5981
  DBVPG6765#1#chrI    210581  209798
  UWOPS034614#1#chrI  213064  212060

impg -p mm2.eqx.paf -r UWOPS034614#1#chrI:1000-2000 -P | head -n 5 | column -t 
  UWOPS034614#1#chrI  214332  1000    2000    +  UWOPS034614#1#chrI  214332  1000  2000  1000  1000  255  cg:Z:1000=
  S288C#1#chrVIII     581049  570385  569601  -  UWOPS034614#1#chrI  214332  1000  2000  665   1009  255  cg:Z:10=1I24=1X4=2I2=1X5=1D3=1X1=1X5=1D3=2X10=1X29=1X4=1X1=2X9=2X13=18D6=1D5=2D1=3D19=2X2=1D6=1D1X16=1X1=1D10=1X12=2X1=1D1=1X2=2X3=1X17=1D15=1X18=1X1=1X2=5D1=2D4=1X1=1X3=1X1=1X1=1X4=1D1X5=1X1=1X7=1X5=1X2=1X6=1X4=1X15=1X4=3X7=1X1=1X1=2X10=1X1=1X1=1X1=1X3=1I13=1X5=1X2=3X3=1X1=1X1=1X2=1X4=1X1=1X2=1X12=1X7=1X5=1I1X1=1X3=1X4=2X2=1X3=1X1=1I10=1D4=1X4=1X2=1X8=3I16=1X5=151D17=2X7=1X23=1X5=1X14=1X20=1X2=1X2=2X4=1X5=1X12=1X4=1X5=2X4=1X3=1X1=1X1=2X2=1X1=1X3=1X2=1X11=1X2=1X8=1X3=1X4=1X4=2X4=1D17=1X4=33D
  DBVPG6765#1#chrVI   257436  5195    5981    +  UWOPS034614#1#chrI  214332  1000  2000  667   1009  255  cg:Z:10=1I24=1X4=2I2=1X5=1D3=1X1=1X5=1D3=2X10=1X29=1X4=1X1=2X9=2X13=18D6=1D5=2D1=1D21=2X2=1D6=1D1X16=1X1=1D10=1X12=2X1=1D1=1X2=2X3=1X17=1D15=1X18=1X1=1X2=5D1=2D4=1X1=1X3=1X1=1X1=1X4=1D1X5=1X1=1X7=1X5=1X2=1X6=1X4=1X15=1X4=3X7=1X1=1X1=2X10=1X1=1X1=1X1=1X3=1I13=1X5=1X2=3X3=1X1=1X1=1X2=1X4=1X1=1X2=1X12=1X7=1X5=1I1X1=1X3=1X4=2X2=1X3=1X1=1I10=1D4=1X4=1X2=1X8=3I16=1X5=151D17=2X7=1X23=1X5=1X14=1X20=1X2=1X2=2X4=1X5=1X12=1X4=1X5=2X4=1X3=1X1=1X1=2X2=1X1=1X3=1X2=1X11=1X2=1X8=1X3=1X4=1X4=2X4=1D17=1X4=33D
  DBVPG6765#1#chrI    215496  210581  209798  -  UWOPS034614#1#chrI  214332  1000  2000  664   1009  255  cg:Z:10=1I24=1X4=2I2=1X5=1D3=1X1=1X5=1D3=2X10=1X29=1X4=1X1=2X9=2X13=18D6=1D5=6D19=2X2=1D6=1D1X16=1X1=1D10=1X12=2X1=1D1=1X2=2X3=1X17=1D15=1X18=1X1=1X2=5D1=2D4=1X1=1X3=1X1=1X1=1X4=1D1X5=1X1=1X7=1X5=1X2=1X6=1X4=1X15=1X4=3X7=1X1=1X1=2X10=1X1=1X1=1X1=1X3=1I13=1X5=1X2=3X3=1X1=1X1=1X2=1X4=1X1=1X2=1X12=1X7=1X5=1I1X1=1X3=1X4=2X2=1X3=1X1=1I10=1D4=1X4=1X2=1X8=3I16=1X5=151D17=2X7=1X23=1X5=1X14=1X20=1X2=1X2=2X4=1X5=1X12=1X4=1X5=2X4=1X3=1X1=1X1=2X2=1X1=1X3=1X2=1X11=1X2=1X8=1X3=1X4=1X4=2X4=1D17=1X4=33D
  UWOPS034614#1#chrI  214332  213064  212060  -  UWOPS034614#1#chrI  214332  1000  2000  1000  1004  255  cg:Z:171=4I829=
colindaven commented 2 months ago

Ok, thanks so much for the details, it is clear now.

I'll concat the fastas before alignment and test out the new mm2 code on some excessively large plant genomes.

I'll try wfmash as well.

ekg commented 2 months ago

aligners: We have developed wfmash to work well on plant genomes. A lot of testing has focused on comparing wfmash to other methods in highly divergent regions. It's similar in performance to anchorwave but does not depend on gene annotations. Let us know what works and doesn't. We are developing the publication now after several years of refinement.

impg: Note also that you can use bgzip indexing with the PAF. I'll update the readme to make it more clear.