vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

A how-to tutorial for vg giraffe #3303

Closed brettChapman closed 3 years ago

brettChapman commented 3 years ago

Hi

I have a pangenome graph spanning 20 different genomes for each chromosome I produced using PGGB.

I also have high and low coverage genomic short reads from many other different cultivars.

I would like to align these reads to the pangenome graphs (a graph for each chromosome) to generate genotypes for each of the other cultivars.

I've seen this tutorial online for MHC (https://gtpb.github.io/CPANG19/pages/mhc) explaining how to do this using vg map. Would it be very similar using vg giraffe? Are there any parameters with vg giraffe I should consider, or is it relatively safe to stick with the defaults? I had a look at vg giraffe's parameter list and there are quite a few. It would be great if a tutorial for vg giraffe could be put up in the Wiki page (I haven't come across one if there is one).

Thank you for any help you can provide.

brettChapman commented 3 years ago

Can vg giraffe take multiple paired reads? Or do I need to produce multiple GAM files. If so, does vg augment accept multiple GAM files prior to running vg call? Thanks.

userzxyz commented 3 years ago

Thank you for opening this issue. I am at the same point trying to figure things out for vg giraffe and count not find any tutorial similar as available for vg map. It would be helpful to get such a tutorial for reference.

adamnovak commented 3 years ago

We would like to have a tutorial for Giraffe, but I think you're right that there isn't one yet. We're still working on vg autoindex, which is supposed to let you set up your files for Giraffe without having to go get toil-vg.

Giraffe isn't going to take a graph for each chromosome; all the graphs need to be combined into a single file, with the node ids adjusted if necessary so they don't conflict.

I'm not quite sure what the right way to get your PGGB graphs (with embedded paths for the different input samples?) into the GBWT of haplotypes that Giraffe needs for alignment. @jeizenga does vg autoindex support that flow yet?

The default settings for Giraffe have worked OK for us oin short reads against a graph that's mostly SNPs and a graph that's mostly big SVs, so we think they are fairly robust.

Giraffe is meant to align a single sample/read group at a time. I don't think it can do any sort of read group tagging or route output to multiple files, and it won't handle different paired end samples with different fragment length distributions in the same run. But you can feed paired reads in as paired or interleaved FASTQ, and you will always get out an interleaved GAM that the calling pipeline can use.

brettChapman commented 3 years ago

Thanks @adamnovak

How would I go about merging my chromosome graphs to avoid conflict with node ids?

Since vg giraffe wont take multiple fastq files or GAM files, I think merging all the fastq files for a particular sample into one file (interleaved) or two (paired) should work prior to aligning.

What exactly does autoindex do? Does it simply prepare files for mapping or is it part of an entire genotyping pipeline?

I had a look at the usage of autoindex. It looks like it can prepare files for giraffe. There's an option for adding features from a GFF. Is that much like vg rna where you can add exons and transcript IDs? I assume with these added, any output VCF would be annotated as UTR, missense, nonsense variants etc, much like snpEFF does with VCF when given a GFF and genome sequence?

brettChapman commented 3 years ago

In regards to joining graphs, I found an old issue which I posted a while back: https://github.com/vgteam/vg/issues/2899

vg ids -j followed by vg concat should do the trick. Hopefully I have enough memory to join all these graphs.

brettChapman commented 3 years ago

I just had a look at vg combine usage, and it appears it resolves node id conflict automatically.

singularity exec --bind ${PWD}:${PWD} /data/vg_builds/vg.sif vg combine
usage: /vg/bin/vg combine [options] <graph1.vg> [graph2.vg ...] >merged.vg
Combines one or more graphs into a single file, regardless of input format.
Node IDs will be modified as needed to resolve conflicts (in same manner as vg ids -j).

Options:
    -c, --cat-proto       Merge graphs by converting each to Protobuf (if not already) and catting the results.                          Node IDs not modified [DEPRECATED]
    -p, --connect-paths   Add edges necessary to connect paths with the same name present in different graphs.
                          ex: If path x is present in graphs N-1 and N, then an edge connecting the last node of x in N-1 
                          and the first node of x in N will be added.

Is it safe to simply use vg combine on each of my chromosome graphs to generate a merged graph without any prior preparation with vg ids? Should I use -p and -c during the merging?

I also have an augmented version of my graphs with embedded gene, exon, CDS regions from a reference sequence. Would using the augmented version of the graph prove problematic for read mapping using giraffe or should I use the un-augmented version for mapping, especially if I run vg autoindex with the -f, -x and -a options which I assume adds exons and transcript_ids to the graph prior to mapping. Thanks.

glennhickey commented 3 years ago

It's usually best to keep big graphs in xg format, and vg id -j followed by vg index -x, as described in this old wiki is a pretty efficient way to do it.

If you want your whole-genome graph in a format that can be modified (ie vg instead of xg), then vg combine on its own will work fine, but memory usage will probably be much higher than with xg so it's better to modify your chrom graphs individually, then merge into read-only xg.

brettChapman commented 3 years ago

Thanks @glennhickey

Does vg ids -j take a packed graph or only .vg? I was thinking to reduce memory usage, a packed graph may be better. I really need to reduce the memory overhead as much as possible to avoid hitting the memory limit.

jeizenga commented 3 years ago

It should take a packed graph just fine.

brettChapman commented 3 years ago

Hi, I've looked through the Giraffe wiki https://github.com/vgteam/vg/wiki/Mapping-short-reads-with-Giraffe

There doesn't seem to be an option for autoindex to take an indexed graph.

I ran vg ids -j, then vg index -x for each of my graphs to generate one single large pangenome graph in .xg format.

From this point on I'm a little lost, as vg autoindex seems to only take a GFA file. However, vg giraffe takes an .xg file, but then I don't have any of the other inputs such as the minimiser index, distance index etc as outlined in the Wiki, which I assume vg autoindex generates for input into vg giraffe.

My only inputs at this point is my merged pangenome graph (.xg) and my Fastq files. I have GFF files for each of the genomes in my pangenome I can add as reference for transcript IDs and exon regions.

Are there plans for vg autoindex to take an indexed graph (.xg) as well? Thanks.

jeizenga commented 3 years ago

Giraffe uses the GBWT (.gbwt) and GBWTGraph (.gg) in place of an XG. vg autoindex will create both of those along with the minimizer index and distance index when you run it with the --workflow giraffe option. autoindex is designed to take common interchange formats as input. For the graph topology, that will generally be either a GFA or a paired FASTA and VCF. Presumably you used one of these inputs to create the XG, right?

brettChapman commented 3 years ago

The XG was generated from multiple .vg inputs (7 chromosomes), which I then indexed all together to create one single XG.

I'm now thinking instead of generating a single XG, I should have ran vg combine to generate a single VG file, which I could then convert to GFA using vg view -g. My only concern as pointed out by @glennhickey is that using anything other than XG will be at the cost of high memory usage. Something which may cause issues. Each node on my cluster has a maximum of 126GB, although I've increased swap space to 2TB, so it may be ok, but at the cost of runtime.

jeizenga commented 3 years ago

If you have multiple FASTAs/VCFs (e.g. for separate chromosomes), you can provide them all to autoindex by repeating the -r/-v options. autoindex will automatically try to respect the available memory on the local machine, but if you still have issues with it using too much memory, you can encourage it to have greater memory thrift with -M.

adamnovak commented 3 years ago

I don't think vgs -> one big vg -> GFA -> autoindex is really the right path. And we load all our vg files as I think PackedGraph internally now, so they should be pretty memory-efficient just to open @glennhickey.

If you want to try and re-use some existing files with autoindex, you can use the secret menu of autoindex where you tell it -P "index type:filename,filename,filename,..." and it will load the files you told it instead of generating those intermediates itself. But if you do that, you need to know a bit about autoindex and the recipes it uses in order to still get a correct result.

The types of indexes it knows about are defined here: https://github.com/vgteam/vg/blob/271c428deca14695386fa0ede61ba52f48ff98ac/src/index_registry.cpp#L330

But if you just provide it the XG index, and no VCFs or vg files with embedded alt allele paths (-a from construct), it won't use real haplotype information, and it won't fall back on the named paths in the graph for haplotypes either. It will just generate haplotypes to greedily cover all the variation, which might not be what you want.

PGGB leaves you without any VCFs, but it ought to cover the input haplotypes with their own named paths.

If your XG has embedded paths that cover all the haplotypes you want to align to (which would show up in vg paths -L -x graph.xg), then you can manually build a GBWT that just describes those paths:

vg gbwt -x graph.xg --index-paths -o graph.gbwt

Giraffe knows how to build missing indexes, so you should be able to feed it just the XG and the GBWT and let it build everything else it needs from those before it runs:

vg giraffe -x graph.xg -H graph.gbwt -f paired-reads.fq --interleaved --progress >mapped.gam
brettChapman commented 3 years ago

Thanks @adamnovak

My graph contains all the haplotypes I want to align to, so I think using vg gbwt followed by vg giraffe would be the way to go. I'll update again if I come across any issues. Thanks.

brettChapman commented 3 years ago

I've tried running vg gbwt but I've been getting a message appearing during the run about the sequence being too long:

srun -n 1 singularity exec --bind /data/pangenome_snp_calling:/data/pangenome_snp_calling /data/vg_builds/vg.sif vg gbwt -x barley_pangenome_graph.xg --index-paths -o barley_pangenome_graph.gbwt
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping
GBWTBuilder::insert(): Sequence is too long for the buffer, skipping

I'm gbwt indexing using the XG graph I indexed from the multiple .vg files using vg ids -j and vg index -x

jltsiren commented 3 years ago

The default size for GBWT construction buffer is 100 million nodes. The buffer must be large enough to fit a path and its reverse complement (and ideally many similar paths at once). If your paths are longer than 50 million nodes, they don't fit into the buffer. You can increase buffer size to N million nodes with option --buffer-size N.

brettChapman commented 3 years ago

Hi

I ran vg giraffe, and after a few weeks running it was killed off by OOM. The node I'm running on has 126GB free memory, with 2TB allocated to swap space. Besides trying to get access to a machine with more RAM, is there anything else I can try to get this to complete? Could I reduce the number of cores (32 cores is the max), is it possible to align in a iterative process with the reads, instead of aligning all reads in one job, etc? Thanks.

vg giraffe -t 32 -x barley_pangenome_graph.xg -H barley_pangenome_graph.gbwt -f Compass/Compass.fq.gz -N Compass --interleaved --progress
Checking for availability of gg index: Needs to be built. Checking dependencies...
Checking for availability of vg index: Loadable from barley_pangenome_graph.xg
Checking for availability of gbwt index: Loadable from barley_pangenome_graph.gbwt
Checking for availability of gbwt index: Loadable from barley_pangenome_graph.gbwt
Checking for availability of min index: Needs to be built. Checking dependencies...
Checking for availability of gg index: Needs to be built. Checking dependencies...
Checking for availability of vg index: Loadable from barley_pangenome_graph.xg
Checking for availability of gbwt index: Loadable from barley_pangenome_graph.gbwt
Checking for availability of dist index: Needs to be built. Checking dependencies...
Checking for availability of vg index: Loadable from barley_pangenome_graph.xg
Checking for availability of snarls index: Needs to be built. Checking dependencies...
Checking for availability of vg index: Loadable from barley_pangenome_graph.xg
Building min
Building gg
Loading vg from barley_pangenome_graph.xg
Loading gbwt from barley_pangenome_graph.gbwt
Building dist
Building snarls
srun: error: node-9: task 0: Killed
srun: Force Terminated job step 23291.0

OOM kill message:

dmesg -T| grep -E -i -B100 'killed process'
[Thu Aug  5 03:49:34 2021] [2193599]     0 2193599  1322176      953   774144     3044          -900 snapd
[Thu Aug  5 03:49:34 2021] [2208947]   112 2208947    50621      656    65536      119             0 munged
[Thu Aug  5 03:49:34 2021] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=step_0,mems_allowed=0,global_oom,task_memcg=/slurm_node-9/uid_1000/job_23291/step_0,task=vg,pid=2147191,uid=1000
[Thu Aug  5 03:49:34 2021] Out of memory: Killed process 2147191 (vg) total-vm:52915396112kB, anon-rss:124475328kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:5607432kB oom_score_adj:0
adamnovak commented 3 years ago

It looks like you did indeed run out of memory, while finding snarls in the graph or making the distance index. It didn't even get to read mapping (the snarls file is an itnermediate it needs to build other mapping indexes), so changing the core count or batching the reads probably would not help.

You say you let Giraffe work on this for a few weeks? Was it swapping the whole time? That's a highly atypical runtime.

How big is your xg file, compared to the 120 GB of memory you have? The snarl finding needs a couple copies of the graph topology information in memory.

One option here is to build the snarls and then the distance index which uses them as part of a separate job. As is, it loads up a couple other indexes it has available before trying to build these, which means there's less memory for indexing. You could try:

vg snarls -T barley_pangenome_graph.xg >barley_pangenome_graph.snarls
vg index -x barley_pangenome_graph.xg -s barley_pangenome_graph.snarls -j barley_pangenome_graph.dist

Then you could run Giraffe and pass along that file, and let it build the other indexes:

vg giraffe -t 32 -x barley_pangenome_graph.xg -H barley_pangenome_graph.gbwt -d barley_pangenome_graph.dist -f Compass/Compass.fq.gz -N Compass --interleaved --progress

However, it's possible that your graph is sufficiently tangled that the real problem is the per-snarl tables in the distance index being too big. I think @xchang1 made some fixes that allow some enormous tables to be left out, but I'm not sure those have made it into a release yet. If that were the problem, I'd expect the vg snarls command I gave to run through in reasonable time, but the vg index command would take forever and/or run out of memory.

Here's the Linux static binary from current vg master, which definitely has the latest distance index code. You could try using it instead of v1.33. vg.zip

If that doesn't help, and it is distance indexing that is taking all the time, you'd want to work with Xian to see if there's other fixes she can make to the distance index to make it able to deal with your graph, or whether the upcoming distance index v2 that she's working on is likely to solve your problem.

If instead it is the snarl finding that is taking all the time and memory, I would probably be the right person to debug that, but I would probably have to get the xg file and profile the code to see what about it was so hard to deal with.

brettChapman commented 3 years ago

Thanks @adamnovak

I'm now rerunning again with the version of vg you provided. I'm breaking the commands down into snarls and index first before running giraffe. I'll provide an update again. If it is the snarls which are the problem, then I should run into a problem pretty quickly. Thanks.

brettChapman commented 3 years ago

By the way my XG file is 131 GB and my GBWT file is 17GB. I do have a lot of swap space (2TB). Is 131GB XG file too large if I only have 126GB RAM? I've manage to run other tools such as PGGB, which offload a lot to swap space if things get too memory intense.

ekg commented 3 years ago

The indexing has quadratic costs in the number of nodes inside a bubble. We should be pruning these out of the position index somehow, looking only at the snarl chains inside of them. Otherwise, giraffe indexing won't be robust.

On Fri, Aug 6, 2021, 04:30 Brett Chapman @.***> wrote:

By the way my XG file is 131 GB and my GBWT file is 17GB. I do have a lot of swap space (2TB). Is 131GB XG file too large if I only have 126GB RAM? I've manage to run other tools such as PGGB, which offload a lot to swap space if things get too memory intense.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3303#issuecomment-893956791, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEP25T5J46U25SDKZELT3NCLTANCNFSM45VYTMOQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

adamnovak commented 3 years ago

The xg is fairly identical to the representation that we use in RAM, so you really are going to want to have as much memory as your xg, plus more for the actual in-progress indexes.

I'm not sure if PGGB does anything to structure its memory access locality so that the OS can efficiently swap pages out from under it. @ekg, do you have a trick we should be using? I know that vg doesn't do anything like that, so letting it fill memory to the point where it starts to swap could increase runtime to the point where it becomes impractical.

@xchang1 did add some machinery to prune out pathologically large and unstructured snarls from the distance index.

Another possible option for cramming things into less memory would be to use a PackedGraph instead of an xg; I think the PackedGraph can be smaller when loaded, and although the options don't suggest it, we should be able to accept one almost everywhere we take an XG. You would generate one with:

vg convert --packed-out graph.xg >graph.pg

But to do that conversion you need to be able to load both the source XG and the PackedGraph in progress into memory, and we've never tested it to see if it's feasible to do from swap.

Looking back at the thread, it looks like you do have individual contig PackedGraphs. We don't have any machinery in vg to combine those while doing all the work in PackedGraph format, but I'm sure it could be done using e.g. the libbdsg Python module.

xchang1 commented 3 years ago

I have a branch that will just skip big bubbles. It'll build a distance index but it won't run with giraffe, I think the plan was to add stuff to the new distance index to deal with this rather than trying to add a fix for the soon-to-be-outdated one. But if distance indexing is the problem I can add a quick fix to make giraffe work with this index. It'll be slow because it'll do a traversal of the graph but hopefully not too slow because I can have it give up if the distance gets too big. @brettChapman let me know if you need this and I can hack something up for you https://github.com/vgteam/vg/tree/distance-big-snarls

brettChapman commented 3 years ago

Thanks @adamnovak and @xchang1

My job is still running, still working at generating the snarls at this stage (been running 20 days so far). I may kill it. I've recently re-ran PGGB with the latest build and my graphs look a lot more clean (using a minimum match length of 316bp for seqwish ~ the most common repeat length in the reference genome), so I'll probably try running giraffe with the new graphs I've generated.

I see vg has recently been updated to 1.34.0 so I'll try the new vg build with the new graphs and let you know how it goes.

brettChapman commented 3 years ago

@adamnovak and @xchang1 I've tried producing snarls using v 1.34 and the memory still becomes an issue on a 1.4TB RAM machine. I'm now trying autoindex with a GFA file I produce of all chromosomes, in hopes it manages the memory better, however I get a message about the sequence lengths being too long. I got around this with vg gbwt by setting the buffer size to 1000, but there doesn't seem to be an identical parameter for autoindex.

If using autoindex isn't the way to get around the memory issue, then I may need to go down the path of a modified vg with a fix to the distance indexing as suggested by @xchang1 previously.