vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.08k stars 191 forks source link

Giraffe how to find off-target sequences #4170

Open SZ-qing opened 7 months ago

SZ-qing commented 7 months ago

I have a sequence of 27bp length, and I want to find out the exact match of this sequence, and all the sequences that have 1 mismatch and 2 mismatch, by the method of graph.

I used bowtie1 to find this sequence in the linear HG38 genome with 1 sequence with one mismatch and 12 sequences with two mismatches. But I compared it in graph using giraffe and it came up with one paths, so I'm wondering if the graph reference genome can accomplish what I did in the linear genome?

linear HG38 results:
image

human pangenome:
command line is:
vg giraffe -Z ../hprc-v1.1-mc-grch38.gbz -m ../hprc-v1.1-mc-grch38_k16_w6.min -p -f ./input.fq --max-alignments 10000 -o gam --max-multimaps 10000 > ./output_k16_w6_aln_M10000.gam

graph file was downloaded from: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38

image

Thanks!!

jeizenga commented 7 months ago

I don't think there's currently any simple way to do what you're asking in VG, but it might be possible with enough fiddling. For a k-mer based algorithm to be guaranteed to find mappings with 2 mismatches in a sequence of length 27, you would need to set k = 9 (i.e. 27 / (2 + 1)). In the human genome, nearly all 9-mer matches will be spurious. Matches only start to become unique around ~15 bp. My guess is that in setting k = 9, you will also have to adjust many other parameters (especially the --hit-cap and --hard-hit-cap) to get vg giraffe to work well. Even then, I expect the mapping to be quite slow, because it will be bogged down in spurious alignments. That said, if you're interested in experimenting with it, it might be possible to troubleshoot a functional pipeline.