medvedevgroup / SibeliaZ

A fast whole-genome aligner based on de Bruijn graphs
http://medvedevgroup.com/
Other
139 stars 19 forks source link

Q on coverage, multi-genome alignment, and maf2synteny #41

Open ohdongha opened 2 years ago

ohdongha commented 2 years ago

Hi, thanks for this fast and easy-to-run aligner :) I have two questions.

  1. How the "coverage" in the report was calculated? I had a coverage value 0.95 from a human GRCh38 to human T2T alignment run. Is the coverage = (total non-overlapping length covered by the alignment) * 2 / (length of genome 1 + length of genome 2)?
  2. The instruction goes sibeliaz genome1.fa genome2.fa but what if I want to align more than two genomes? Can I just keep adding genome3.fa genome4.fa ...? And in this case how the coverage would be calculated?

Thanks a lot!

iminkin commented 2 years ago

Hi!

The coverage is calculated as # the total number of positions in all genomes covered by the alignment / total length of the genomes. The alignments blocks output by SibeliaZ do not overlap, at least they are not supposed to. You can add as many input genomes as you want. One more note on coverage: SibeliaZ filters out repeats with many copies, so sometimes the coverage can be less than expected. This behavior is parameter-dependent though.

ohdongha commented 2 years ago

Thanks, Ilia, for the quick response :) I have a few more questions regarding the "non-overlapping" part since this appears a requirement for maf2synteny.

  1. The 1st alignment from the GRCh38-T2T comparison (attached below) had numerous (=180) sequences from mostly chromosome 1s from the two assemblies. Likewise, the alignments at the beginning of the MAF output contained many sequences aligned. But this number of sequences decreased and later alignments are mostly 1:1.
    human_GRCh38_vs_T2T_firstBlock.maf.txt I assume "non-overlapping" means that even if each alignment can be 1:multi, multi:1, or multi:multi, the same genomic region won't appear in more than one alignment. Is it correct?

  2. The 1st alignment appears to be from repeats considering the many sequences in it. I used the default setting (-a 150) which would remove k-mers with frequency >150. Does this mean that repeat with copy number <150 can be still captured in the alignment while higher-copy repeats are more likely removed?

  3. I read that MAF is "reference-based". Does SibeliaZ consider one of the genomes as the reference, e.g. the first one among the two (or many) given as arguments? If so, does the output MAF have a distinction about which is from the reference?

  4. How does the maf2synteny deal with the sequences in 1:multi, multi:1, or multi:multi alignments when merging alignments to create longer synteny blocks? Will they be ignored or somehow some (or one?) of them be included in the synteny block?

  5. Is there an easy way to find out the proportions of each genome that were aligned 1:1 or 1:multi / multi:1 / multi:multi from the MAF or GFF output?

Thanks again!

iminkin commented 2 years ago

Hi @ohdongha ,

Sorry for the late reply.

  1. Yes, each position in the genome is only supposed to be a part of a single alignment block.
  2. Yes, lower frequency repeats can be captured, depending on the parameter -a.
  3. MAF can be both reference or non-reference based. I am not sure, but it is likely that the notion of MAF being reference-based comes from the fact that earlier WGA tools like MultiZ were reference-based. SibeliaZ does not treat any particular input genome as a reference. Also, some other aligners (e.g. Cactus) also use MAF, but are not reference-based.
  4. I am not the author of maf2synteny, so I would refer this question to Mikhail Kolmogorov, who wrote the software. AFAIK, maf2synteny follows the general logic of Sibelia, where shorter blocks are gradually merged to form longer ones. This process does not have specific conditions on the multiplicity of the blocks. In other words, the resulting blocks could be either 1:multi, multi:1, or multi:multi, all depending on the actual data.
  5. I think the easiest way to compute it is to write a short Python script analyzing the data. It should be relatively straightforward. For example, BioPython has MAF-parser, which makes reading and operating MAF files very easy (https://biopython.org/docs/1.75/api/Bio.AlignIO.MafIO.html).

Hope this helps!

ohdongha commented 2 years ago

Thanks a lot, Ilia.
I noticed that sometimes the first sequence of the SibeliaZ output maf was on - strand direction and some tools (e.g. phyloFit of PHAST) don’t like it. In case of Cactus, it seems to estimate an ancestral sequence and put it on top of all alignments (even for a pairwise comparison) (and in + direction) so that the ancestral sequence kind of serve as the reference. But these may be more about the MAF specification not very well defined/agreed upon?

I also noticed that some alignment blocks have sequences from only one species (more precisely, only one chromosome of a species). Does SibeliaZ print paralogous alignment as well?

I will also try ‘maf2synteny’ and the MAF-parser and report back! Dong-Ha