vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

vg autoindex; child process 237114 signaled with status 9 representing signal 9 #3841

Closed peterdfields closed 1 year ago

peterdfields commented 1 year ago

1. What were you trying to do?

I am trying to create an index for a gfa generated with PGGB.

2. What did you want to happen?

With the majority of the chromosomes I'm working with the indexing step completed successfully.

3. What actually happened?

I received the following output on the terminal (this is different than errors I've received for memory limits, which is usually just Killed; this job had nearly 400GB of memory):

[vg autoindex] Executing command: vg autoindex -w map -p ch9 -g chr9.fa.gz.e316b50.417fcdf.7fecd6e.smooth.final.gfa --tmp-dir ./tmp -V 2
[IndexRegistry]: Checking for haplotype lines in GFA.
[IndexRegistry]: Provided: Reference GFA
[IndexRegistry]: Constructing VG graph from GFA input.
[IndexRegistry]: Constructing XG graph from VG graph.
[IndexRegistry]: Pruning complex regions of VG to prepare for GCSA indexing.
Restored graph: 6641257 nodes
[IndexRegistry]: Constructing GCSA/LCP indexes.
[IndexRegistry] forked child 237114
InputGraph::InputGraph(): 300331356 kmers in 1 file(s)
InputGraph::readKeys(): 211092825 unique keys
InputGraph::readFrom(): 258082151 unique start nodes
GCSA::GCSA(): Prefix-doubling from path length 16
GCSA::GCSA(): Step 1 (path length 16 -> 32)
GCSA::GCSA(): Step 2 (path length 32 -> 64)
GCSA::GCSA(): Step 3 (path length 64 -> 128)
GCSA::GCSA(): Step 4 (path length 128 -> 256)
error:[IndexRegistry] child process 237114 signaled with status 9 representing signal 9
Killed

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

I did not receive any such stacktrace.txt file.

5. What data and command can the vg dev team use to make the problem happen?

I am working on a graph generated by running PGGB on two different versions/genotypes of chromosome 9 in the mouse genome. The chromosome is in the neighborhood of 125Mbp.

6. What does running vg version say?

version v1.45.0 "Alpicella"

Please let me know if additional information would be helpful.

jeizenga commented 1 year ago

Are you running in an environment with limited hard disk space? Most of the GCSA indexing algorithm is disk-backed.

jltsiren commented 1 year ago

If this is a PGGB graph, pruning with the "restore paths" option will probably not do anything, because every node and every edge is used on a path. "Unfold paths" is the right option, but maybe vg autoindex does not choose to use it if there are only two haplotypes.

Manual index construction should work, but unfortunately I'm no longer familiar with the best practices for vg map with GFA input.

peterdfields commented 1 year ago

@jeizenga There isn't a disk space limit. I was able to proceed with the other chromosomes without any issues. I can try moving to another large disk space just to be sure.

@jltsiren I can give manual index construction a try, though as you say I don't see any documentation for how to do this on a gfa. vga autoindex does generate the a .xg file but I'm not sure if it is complete since the autoindex step didn't complete successfully. Do you know of a way I might check this?

jeizenga commented 1 year ago

@peterdfields The XG-indexing step completed successfully, so the .xg file should be complete.

@jltsiren Is the issue that PGGB uses P lines instead of W lines to express haplotypes? We could maybe add a heuristic to be more robust to this, like only restoring the longest path in each connected component.

jltsiren commented 1 year ago

@jeizenga P-lines are probably the issue. Right now, PhaseUnfolder::restore_paths() restores all paths accessible with for_each_path_handle(). We could use only reference/generic paths, but I'm not sure VG can identify path senses from PGGB graphs automatically.

@peterdfields If you have a single-chromosome VG/XG graph, you are in the small graph / complex graph / with many paths scenario described in the wiki. The following commands should be enough:

vg prune -u graph.vg -m node_mapping > graph.pruned.vg
vg index -g graph.gcsa -f node_mapping graph.pruned.vg

Add option -p to both commands if you want to see some progress information. You can use the XG graph instead of the VG graph in vg prune. graph.pruned.vg and node_mapping are temporary files that can be deleted once you have the GCSA index.

The thing I'm not sure about is the translation from VG node ids to GFA segment ids. GCSA works with up to 1024 bp nodes, while a GFA file may have segments of arbitrary length. vg autoindex already chopped long segments for you, but I'm not sure if the information needed for reversing that is stored anywhere.

peterdfields commented 1 year ago

@jltsiren Thank you for your suggestions. Regarding the commands you suggest, in order to get to a .vg from either a .xg or a .gfa it looks like I need to use vg convert? And then from there it looks like I also need to run vg mod -X 256? Using these two intermediate steps I can get the indexing to complete successfully for the chromosome that created an error. However, I did try this same process on a chromosome set that did complete successfully with vg autoindex. The resultant index files (one with vg prune, one without) are not the same at least in size to that which was created with vg autoindex but perhaps that's not so important?

jeizenga commented 1 year ago

Yes, you can use vg convert to get a .vg file. It's not necessary to use vg mod -X 256 though, vg autoindex effectively does that for you. The pruned graphs would be expected to be different sizes, as the manual pipeline that @jltsiren described is slightly different than what's implemented in vg autoindex.

peterdfields commented 1 year ago

@jeizenga Are there any obvious/substantive effects one should expect when using a manual pipeline vs. vg autoindex? Should I go ahead and use the manual pipeline on all the other chromosomes that have already been indexed with vg autoindex to avoid downstream heterogeneity?

jeizenga commented 1 year ago

The difference would be in the GCSA2 index's ability to detect all relevant matches, but I'd expect the difference to be pretty marginal. I don't think we have any method to combine GCSA2 indexes across separate graphs anyway though. How were you planning to combine your chromosome indexes?

peterdfields commented 1 year ago

@jeizenga Okay, I'll try some small experiments as I go to see if anything obvious pops up. I hadn't planned on combining indexes. I was assuming I would have to conduct analyses on individual chromosomes at a time. I realize this isn't ideal for the mapping stage but it seemed like the computational resources for indexing the whole genome (19 chromsomes, ~2.5Mbp in total length) would be extreme. Maybe you have a recommendation for accomplishing this? Maybe doing the manual indexing initiated above would decrease the total resource requirement substantially? And regarding the eventual application of the graph, I was hoping to apply graph_peak_caller for a relatively small ATAC-seq dataset. If there are any other vg related tools you could recommend for chip/atac-seq that would be great as the graph_peak_caller development seems to have stopped and I'm concerned that many advances in vg and related tools are going to be a problem for this approach.

jeizenga commented 1 year ago

Yeah, I would advise strongly against having separate indexes during mapping. That would make it very hard for the internal algorithms to decide how much effort to put into a read when it doesn't immediately find a good mapping, and it would really mess up the uncertainty estimation. The end effect will be very slow mapping and inaccurate mapping qualities. Plus, the task of merging the read alignments from separate mapping runs is itself nontrivial.

At 2.5 Mbp per chromosome, that's still only 50 Mbp in total. We regularly index and map to genomes in the several Gbp range. The VG toolchain generally expects that you have an amount of memory available that might be challenging on a laptop (10s of GB), but most compute servers should be able to handle it.

I am not aware of any other ChIP/ATAC-seq tools that use pangenome graphs though.

peterdfields commented 1 year ago

@jeizenga Sorry, I meant ~2.5Gbp (mouse). Is there a particular strategy/flags you've used for multi-Gbp genomes? At least using the vg autoindex function I was sometimes needing on the order of 100s of Gb of RAM for some chromosomes to get the indexes to complete successfully. I figured there was no way I could get this to scale to the whole genome given how much memory some of the individual chromosomes needed.

jeizenga commented 1 year ago

For simpler graphs produced from VCF data or e.g. by Minigraph or Minigraph-Cactus, the default parameters in vg autoindex usually work fine. PGGB produces graphs with a more complex topology that we're still trying to find reliable ways to index. One option is to try to heavily prune the graph with vg prune before making the GCSA2 index. My suspicion is that once you find parameters that work for any chromosome individually, they will probably work fairly well for all chromosomes jointly. If you want to process the chromosomes differently, you could also join them before making the GCSA2 index using vg combine.

peterdfields commented 1 year ago

@jeizenga Thank you for your assistance in troubleshooting this problem. For now I have reverted back to using variant calls and vg to create the graph. For this particular project I don't think that that pggb is necessary, and given we're trying to use the graph_peak_caller pipeline, maybe not even possible. Everything was working well till I got to the vg view step to convert the graph from .vg to .json step. I'll close this issue for now. Thank you @jeizenga and @jltsiren again.