vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 195 forks source link

Hal to vg issues #3621

Open Zazzyre opened 2 years ago

Zazzyre commented 2 years ago

Hi, I'm trying to use a vg file output from hal. I found this issue and I was trying to follow through the commands but vg explode and vg mod are deprecated and I'm getting a little lost:

Prepare the vg file

hal2vg --noAncestors --refGenome myref chr1.hal > tmp1.vg vg mod -X 32 tmp1.vg > tmp1.chop32.vg vg explode tmp1.chop32.vg exploded/ vg mod -D exploded/component0.vg > chr1.final.vg

Generate final gbwt index

vg index -T -G chr1.final.gbwt -F thread_names ./exploded/component0.vg

Generate final xg index

vg index -T -x chr1.final.xg -F thread_names chr1.final.vg

Generate pruned graph

vg prune -u -g chr1.final.gbwt -m node_mapping chr1.final.vg > chr1.final.prune.vg vg index -g chr1.final.gcsa -f node_mapping -b ./tmp chr1.final.prune.vg


_Originally posted by @eldariont in https://github.com/vgteam/sv-genotyping-paper/issues/6#issuecomment-538982451_
adamnovak commented 2 years ago

Hello @Zazzyre!

The vg mod -X command is just there to break up the nodes so they are no larger than 32 bp, because vg is happiest when the nodes are manageably sized. You can dispense with it if you do:

hal2vg --chop 32 --noAncestors --refGenome myref chr1.hal > graph.vg

I think that might actually be all you really need to do; the rest of the pipeline that @eldariont gave in https://github.com/vgteam/sv-genotyping-paper/issues/6 is meant to throw away any tiny floating pieces of the graph that might exist, using vg explode (I think to work around hal2vg bugs that @glennhickey has fixed?), and then does a lot of work to pull out the paths from the HAL and turn them into GBWT threads to use for pruning the graph while building GCSA indexes for vg map. We've wrapped up a lot of index building intelligence into vg autoindex since then.

I think you probably can just take the graph straight from hal2vg, if you used --chop, and pass it to vg autoindex to build indexes for whichever mapper you want to use. If your graph is well-behaved enough it should Just Work.

If you run into trouble because the genome paths in the HAL really want to be interpreted as haplotypes for vg giraffe mapping and aren't, or because the graph is too complex for vg autoindex to build the GCSA and you need to fiddle with how it prunes and back-fills complex regions, than you might need to do some of this more complicated graph surgery.

Zazzyre commented 2 years ago

Awesome @adamnovak. I appreciate the reply.

With vg autoindex what are the required inputs or the syntax for giving it the .vg graph? Its giving me an insufficient input message: Input is not sufficient to create indexes Inputs GTF/GFF Reference FASTA are insufficient to create target index GCSA

jeizenga commented 2 years ago

vg autoindex expects more standard interchange formats like GFA, so the best move would probably be to convert the .vg file into a GFA first with vg convert -f.

Also, my guess is that you probably don't need to provide either the FASTA or the GTF/GFF file to autoindex. The FASTA input is intended to be used with a VCF file to define a graph. If you have the GFA file (which also defines a graph), there's no need to include the FASTA. The GTF/GFF file is only for adding splice junctions as graph edges. That's only beneficial if you plan to be aligning RNA-seq data.