Open ohdongha opened 2 years ago
Hi!
The coverage is calculated as # the total number of positions in all genomes covered by the alignment / total length of the genomes. The alignments blocks output by SibeliaZ do not overlap, at least they are not supposed to. You can add as many input genomes as you want. One more note on coverage: SibeliaZ filters out repeats with many copies, so sometimes the coverage can be less than expected. This behavior is parameter-dependent though.
Thanks, Ilia, for the quick response :) I have a few more questions regarding the "non-overlapping" part since this appears a requirement for maf2synteny.
The 1st alignment from the GRCh38-T2T comparison (attached below) had numerous (=180) sequences from mostly chromosome 1s from the two assemblies. Likewise, the alignments at the beginning of the MAF output contained many sequences aligned. But this number of sequences decreased and later alignments are mostly 1:1.
human_GRCh38_vs_T2T_firstBlock.maf.txt
I assume "non-overlapping" means that even if each alignment can be 1:multi, multi:1, or multi:multi, the same genomic region won't appear in more than one alignment. Is it correct?
The 1st alignment appears to be from repeats considering the many sequences in it. I used the default setting (-a 150) which would remove k-mers with frequency >150. Does this mean that repeat with copy number <150 can be still captured in the alignment while higher-copy repeats are more likely removed?
I read that MAF is "reference-based". Does SibeliaZ consider one of the genomes as the reference, e.g. the first one among the two (or many) given as arguments? If so, does the output MAF have a distinction about which is from the reference?
How does the maf2synteny deal with the sequences in 1:multi, multi:1, or multi:multi alignments when merging alignments to create longer synteny blocks? Will they be ignored or somehow some (or one?) of them be included in the synteny block?
Is there an easy way to find out the proportions of each genome that were aligned 1:1 or 1:multi / multi:1 / multi:multi from the MAF or GFF output?
Thanks again!
Hi @ohdongha ,
Sorry for the late reply.
Hope this helps!
Thanks a lot, Ilia.
I noticed that sometimes the first sequence of the SibeliaZ output maf was on - strand direction and some tools (e.g. phyloFit of PHAST) don’t like it. In case of Cactus, it seems to estimate an ancestral sequence and put it on top of all alignments (even for a pairwise comparison) (and in + direction) so that the ancestral sequence kind of serve as the reference. But these may be more about the MAF specification not very well defined/agreed upon?
I also noticed that some alignment blocks have sequences from only one species (more precisely, only one chromosome of a species). Does SibeliaZ print paralogous alignment as well?
I will also try ‘maf2synteny’ and the MAF-parser and report back! Dong-Ha
Hi, thanks for this fast and easy-to-run aligner :) I have two questions.
sibeliaz genome1.fa genome2.fa
but what if I want to align more than two genomes? Can I just keep addinggenome3.fa genome4.fa ...
? And in this case how the coverage would be calculated?Thanks a lot!