vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.08k stars 191 forks source link

vg grep the none-reference sequence from the Minigrah-Cactus result? #4193

Closed ld9866 closed 6 months ago

ld9866 commented 6 months ago

Dear developer: We currently use Minigrah-Cactus to build the pan-genome and convert it into vg format files. I would like to ask how we can extract the non-reference sequence in the pan-genome into fasta format because I used vg before, but the code was lost due to our negligence. Can you help me?

jeizenga commented 6 months ago

You can extract paths as a FASTA using vg paths --extract-fasta. I think the interface requires a GBWT as input, so you may need to pull out the GBWT from your GBZ using vg gbwt.

ld9866 commented 6 months ago

Dear developer: We encountered some problems in vg index, showing insufficient memory, but our running memory is 1TB, I would like to ask whether our code and thinking are correct, and how should we solve this problem? vg mod -X 256 test.full.vg > test.full.mod.vg vg index -x test.full.xg -g test.full.gcsa -k 16 -t 8 test.full.mod.vg error: InputGraph::InputGraph(): Memory use of input kmers (1149.82 GB) exceeds memory limit (1024 GB) vg gbwt -g test.full.gbwt -t 8 -x test.full.xg test.full.vg

jeizenga commented 6 months ago

Is there a reason you are doing a manual indexing pipeline instead of using vg autoindex? For most users, vg autoindex is more robust to issues like this.

It's also unclear to me which mapping tool you're planning to use. The GCSA2 index is used by vg map, but the GBWT usually is not. The GBWT is typically used in vg giraffe. This is another reason to use vg autoindex: it can determine exactly which indexes you need based on the mapping tool you want to use.