glennhickey / progressiveCactus

Distribution package for the Prgressive Cactus multiple genome aligner. Dependencies are linked as submodules
Other
80 stars 26 forks source link

Block gapped multi chromosome alignments and single character alignments #66

Open ChriKub opened 7 years ago

ChriKub commented 7 years ago

In my .maf file resulting from an alignment of an assembly to its reference (same strain) I get two types of unwanted alignment blocks which seem wrong. First gapped multi chromosome alignments, which each for them self would be a nice alignment block:

a
s       TAIR10.1        12837   15      +       30427671        GACCGGAGCTGCTGC
s       TAIR10.2        15151341        11      +       19698289        GT-TGTTG---CTGC
s       TAIR10.4        17945202        12      -       18585056        GGTCGGAA---CTGC
s       TAIR10.5        1893991 11      +       26975502        GG-CGGTG---TTGC
s       Col-0.000002F_pilon     9954396 11      +       14504382        GT-TGTTG---CTGC
s       Col-0.000003F_pilon     15907   15      -       13710986        GACCGGAGCTGCTGC
s       Col-0.000006F_pilon     1898033 11      -       11231471        GG-CGGTG---TTGC
s       Col-0.000009F_pilon     2250576 12      +       2911187 GGTCGGAA---CTGC

Second large groups of alignments of single characters over several chromosomes and positions:

a
s       TAIR10.1        12852   1       +       30427671        T
s       TAIR10.1        25884799        1       -       30427671        T
s       TAIR10.1        8982441 1       +       30427671        T
s       TAIR10.1        18778352        1       +       30427671        T
s       TAIR10.1        11489262        1       -       30427671        A
s       TAIR10.1        4546710 1       -       30427671        T
s       TAIR10.2        15151352        1       +       19698289        T
s       TAIR10.3        3651227 1       +       23459830        T
s       TAIR10.3        14062842        1       -       23459830        T
s       TAIR10.4        17945214        1       -       18585056        T
s       TAIR10.5        1894002 1       +       26975502        T
s       Col-0.000000F_pilon     4543388 1       +       15315658        T
s       Col-0.000000F_pilon     11481806        1       +       15315658        A
s       Col-0.000000F_pilon     3673796 1       -       15315658        T
s       Col-0.000002F_pilon     9954407 1       +       14504382        T
s       Col-0.000003F_pilon     9012953 1       -       13710986        T
s       Col-0.000003F_pilon     9164554 1       +       13710986        T
s       Col-0.000003F_pilon     15922   1       -       13710986        T
s       Col-0.000004F_pilon     3635083 1       +       13281416        T
s       Col-0.000004F_pilon     3899629 1       -       13281416        T
s       Col-0.000006F_pilon     1898044 1       -       11231471        T
s       Col-0.000009F_pilon     2250588 1       +       2911187 T

How can this behaviour be avoided? I ran cactus on default settings and hal2maf with: hal2maf --maxBlockLen 35000000 --maxRefGap 35000000 --noAncestors --refGenome TAIR10 /TAIR10_Col-0.hal /TAIR10_Col-0.hal.maf as I want to detect all sizes of InDel events from assemblys.

memory-donk commented 6 years ago

I think I have sort of a related problem. I'm trying to align the genomes of three drosophila species downloaded from UCSC against each other and the vast majority of alignments are between scaffolds within the same assembly rather than scaffolds between species. I can understand that there may be some spurious, redundant contigs/scaffolds within some of the assemblies, but it seems as though the output is heavily slanted toward within-genome alignments rather than between-genome alignments. Do you have any explanation for this behavior?

Edit: I think I realized what I did wrong. If I understand things correctly, these are indeed likely to be paralogous segments. Using the --noDupes option in hal2maf.py removes these duplicate segments. That said, outputting these duplicates is very useful because it helps decide which alignments may be too difficult to establish clear 1-to-1 orthology between species. What I thought was an issue turned out to be a very useful features of the aligner. Thanks!