medvedevgroup / SibeliaZ

A fast whole-genome aligner based on de Bruijn graphs
http://medvedevgroup.com/
Other
140 stars 19 forks source link

Validation in Bacteria #19

Open noahaus opened 3 years ago

noahaus commented 3 years ago

I feel like this type of method would work as a fantastic competitor to bacterial whole genome alignment software such as Mauv/progressiveMauve. Have you tried validating this software with bacterial datasets at all? looking to use a program like this for my own research trying to analyze bacterial genomes

iminkin commented 3 years ago

Hi @noahaus,

We only validated SibeliaZ on a simulated bacterial dataset and a real mammalian one. We did not rigorously validate on a real bacterial dataset since our resources are limited. My guess would be that SibeliaZ should work fine on bacterial datasets since their organization is not as complex. However, if you would like to align hundreds of genomes at once, you will probably have to adjust the parameters slightly, especially -a or "abundance" (see documentation),

sapoudel commented 3 years ago

Hey @iminkin I am also trying to use this with bacterial genomes. I am currently trying to align 2000 S. aureus genomes (2.8 kb each) from same clonal complex i.e. they should have very little genomic differences. I am trying to understand how the memory required scales with size of the genome vs. number of genomes aligned.

nick-youngblut commented 2 years ago

I'm also using SibeliaZ on very similar bacterial genomes. As a simple test, I tried aligning a genome to itself, and so I expected to see 2 LCBs (the genome assembly contains 2 contigs), instead 286 blocks were found:

$ sibeliaz -t 8 -o aln_out_TEST Lr-64M-h.fna Lr-64M-h.fna
Constructing the graph...
Threads = 8
Vertex length = 25
Hash functions = 5
Filter size = 17179869184
Capacity = 1
Files:
Lr-64M-h.fna
Lr-64M-h.fna
--------------------------------------------------------------------------------
Round 0, 0:17179869184
Pass    Filling Filtering
1   5   1
2   0   0
True junctions count = 1326
False junctions count = 0
Hash table size = 1326
Candidate marks count = 10864
--------------------------------------------------------------------------------
Reallocating bifurcations time: 0
True marks count: 10864
Edges construction time: 0
--------------------------------------------------------------------------------
Distinct junctions = 1326

Loading the graph...
Analyzing the graph...
[....................................................]
Generating the output...
Blocks found: 286
Coverage: 1.00
Performing global alignment..
noahaus commented 2 years ago

I'm also using SibeliaZ on very similar bacterial genomes. As a simple test, I tried aligning a genome to itself, and so I expected to see 2 LCBs (the genome assembly contains 2 contigs), instead 286 blocks were found:

@nick-youngblut That's pretty interesting to see that. If you dug a bit deeper, I wonder if the LCBs that are created are far apart from each other or relatively close. Maybe SibelaZ is doing something to make sure it doesn't fully collapse all these regions into 1 LCB.

@iminkin do you have any idea what would cause the algorithm to create more LCBs in this situation when the alignment is theoretically perfect?

nick-youngblut commented 2 years ago

The LCBs seem to be relatively close

noahaus commented 2 years ago

Now, I'm no expert on this particular software, but I remember doing some preliminary reading of the nature paper and this might be something to do with the variation graph path-building step. I'm sure there might be a parameter that will get you the two LCBs you are looking for. But I'd still perhaps wait for an official answer from the creator.

I looked at the documentation and noticed that you didn't specify the k parameter. For bacteria, this is recommended to be 15, but the default is geared more towards mammalian genome size?

I take it this isn't going to be a major focus, but it would be interesting to do some benchmarking for your example. What combination of parameters gets us closer to 2 like you expect. If you run again and change -k I have a feeling you'd get a more realistic number of LCBs. Good luck!

nick-youngblut commented 2 years ago

I've switched over to minimap2, which in my case, provides everything I need

iminkin commented 2 years ago

Hi,

One possible reason for this behavior are repeats in the genome. SibeliaZ tries to find all conserved blocks and evolutionary relationships between them to the fullest extent. Sometimes it results in a fragmented representation with many blocks. Let's consider an example, here are two genomes represented as sequences of LCBs, each number is an LCB:

Genome 1: +1 +2 +1 +3 Genome 2: +1 +2 +1 +3

Let's assume that +1 is a kind of a small repeat, like a transposable element or so. Then SibeliaZ will report three LCBs despite the fact that the genomes are nearly the same. If this fragmentation is caused by small elements, one way to make this representation more coarse-grained is to construct synteny blocks afterward. Here is an instruction on how.

If I understand correctly, minimap2 aligns genomes without caring too much about the repeat structure inside each genome as long the genomes are collinear with each other. Both representations are valid and could be more (or less) useful depending on the exact context of the downstream analysis.

Hope this clarifies this case.

jianshu93 commented 6 months ago

Hi all, Just want to add a followup question, Does SibeliaZ also work for slightly distantly related genomes, e.g., average nucleotide identity is above 92% for example (human genomes are always > 99.9%). I noticed the parsnp only works for >97% or so average nucleotide identity. Mauve is just too slow to align only ~30 bacterial genomes.

Thank you,

Jianshu

iminkin commented 6 months ago

Hi @jianshu93,

Yes, SibeliaZ should be fine in this case. Moreover, we did an experiment where we simulated bacterial datasets with different degrees of divergence, and alignment recall was > 95% for the substitution rate of 9%, where the rate was measured between the root and the leaf genomes in the phylogenetic tree. Further details can be found in the section "Effect of parameters and sequence divergence" of our paper (https://www.nature.com/articles/s41467-020-19777-8#MOESM1) and Supplementary Figure 6 (https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-020-19777-8/MediaObjects/41467_2020_19777_MOESM1_ESM.pdf).