Open ChriKub opened 7 years ago
I think I have sort of a related problem. I'm trying to align the genomes of three drosophila species downloaded from UCSC against each other and the vast majority of alignments are between scaffolds within the same assembly rather than scaffolds between species. I can understand that there may be some spurious, redundant contigs/scaffolds within some of the assemblies, but it seems as though the output is heavily slanted toward within-genome alignments rather than between-genome alignments. Do you have any explanation for this behavior?
Edit: I think I realized what I did wrong. If I understand things correctly, these are indeed likely to be paralogous segments. Using the --noDupes option in hal2maf.py removes these duplicate segments. That said, outputting these duplicates is very useful because it helps decide which alignments may be too difficult to establish clear 1-to-1 orthology between species. What I thought was an issue turned out to be a very useful features of the aligner. Thanks!
In my .maf file resulting from an alignment of an assembly to its reference (same strain) I get two types of unwanted alignment blocks which seem wrong. First gapped multi chromosome alignments, which each for them self would be a nice alignment block:
Second large groups of alignments of single characters over several chromosomes and positions:
How can this behaviour be avoided? I ran cactus on default settings and hal2maf with:
hal2maf --maxBlockLen 35000000 --maxRefGap 35000000 --noAncestors --refGenome TAIR10 /TAIR10_Col-0.hal /TAIR10_Col-0.hal.maf
as I want to detect all sizes of InDel events from assemblys.