ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Pangenome alignment #232

Open ekg opened 1 year ago

ekg commented 1 year ago

Is strobealign a good basis for the following application:

We would align against a pangenome FASTA. So we might have 100 genomes in this system. Downstream we project the alignments into a graph for processing with a very cheap approach.

This would benefit from fast indexing and from an absence of assumptions about the number of genome copies we have. Thoughts?

ksahlin commented 1 year ago

The scenario you are mentioning is likely a suitable scenario for strobealign. We had a benchmark on a toy repetitive dataset related to your description in the paper. On this dataset, strobealign had one of its best performances relative to other aligners in accuracy and runtime (see attached figure; the stats are for v0.7.1). Version 0.8.0 may be a bit faster, and a tad more accurate on 50nt-100nt reads. Also, as you said, indexing is typically fast as well.

Characteristics of the REPEATS dataset (described in supplementary note A): We simulated a string of 100,000nt by choosing letters A, C, G, and T at random. We produced 500 copies of this string but introduced a 5% SNP frequency and deleted segments of length between 1nt and 1000nt with probability 0.0001 on each copy. This roughly represents 500 copies of length 90-100kbp at a rough 90% identity between copies with some deletions of various sizes and locations. We furthermore simulated reads from a related genome to the above repetitive genome using mason variator with the parameters–sv-indel-rate 0.00005 –snp-rate 0.005 –small-indel-rate 0.0005 –max-small-indel-size 50.

We would be very happy to get feedback if you decide to test strobealign for your scenario.

Screenshot 2023-02-08 at 21 46 22

marcelm commented 1 year ago

Would that be a FASTA file with 100 concatenated genomes or something smaller with removed redundancies such as the nodes of a variation graph?

Not sure if this play a role in your scenario, but strobealign uses lots of RAM, roughly 6 times as much as the size of the input FASTA. Indexing speed is at least 20 Mbp/s (on my relatively old machine). We can parallelize some of this, so that can get faster if needed.

How would you want the output to be for reads mapping to multiple locations, do you need all locations?

ghuls commented 7 months ago

I would be interested in something similar to this, but with 2 haplotypes per patient (where it would consider the best hapotype for each read, or both if they map equally well).