hsinnan75 / GSAlign

GSAlign: an ultra-fast sequence alignment algorithm for intra-species genome comparison
MIT License
49 stars 16 forks source link

All to all alignment for graphical pangenomics #7

Open brettChapman opened 3 years ago

brettChapman commented 3 years ago

Is it possible to use GSAlign to align multiple genomes with each other? Say by concatenating the genomes and using the same file as the reference file and the query file? I know in your paper you rule out comparison to tools like Cactus which are multiple sequence alignment tools, I was just wondering if in certain circumstances, like in my example below, it could be used as such by aligning all genomes to all genomes? I know the VCF file doesn't show the query sequence, only the reference sequence name. If your code could be updated to show from which query sequence the variant call was made from, then this could be a viable approach to identify variants across a pangenome quickly and easily.

For example:

"rename all fasta headers to uniquely identify varieties"
cat *.fasta > all.fasta
GSAlign -t 16 -sen -dp -gp /usr/bin/gnuplot -idy 90 -one -r all.fasta -q all.fasta -o pangenome

I'm aiming towards doing alignments across 20 different varieties of Barley and then pulling them into a genome graph using VG (https://github.com/vgteam/vg) to visualize the variations. The output VCF from GSAlign could be used as input into VG to do this.

Thanks.

hsinnan75 commented 3 years ago

Hi, it is doable, but I need to modify the codes. I'll let you know if the codes are updated. Thank you for the suggestion.

yassineS commented 3 years ago

@brettChapman you might want to checkout: https://github.com/pangenome/pggb.

Alternatively, you can use GSAlign then convert the output alignment to paf format (checkout paftools.js), then converting the alignment to a graph using seqwish.