glennhickey / progressiveCactus

Distribution package for the Prgressive Cactus multiple genome aligner. Dependencies are linked as submodules
Other
79 stars 26 forks source link

Handling masked characters #110

Closed iminkin closed 6 years ago

iminkin commented 6 years ago

Hi,

How does cactus handle masked characters in the input? I assume it ignores them, but the resulting maf contains masked bases.

joelarmstrong commented 6 years ago

Hi Ilya,

You're right that the soft-masked bases are ignored for the initial local alignment process (generating potential "anchors") for efficiency reasons. But we still end up aligning most repetitive regions, because we extend out from the initial anchors using a more sensitive alignment algorithm. This extension step can align even masked bases, because the problems it has to deal with are much smaller (thousands of kb instead of millions).

So, for example, a ~6000bp LINE1 element shared between human and chimp won't be aligned during the initial alignment process. But, assuming we can find anchors on either side of that element, we can align the LINE1 between human and chimp just fine, because we construct another, simpler alignment problem using just that syntenic region.

Basically, what this ends up meaning is that repetitive elements do end up getting aligned, except in cases where the element has been structurally rearranged relative to its sibling genome, or in very very long stretches (100s of kb) of purely repetitive elements.