vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 193 forks source link

vg autoindex needs a --buffer-size parameter for vg gbwt #3455

Open brettChapman opened 2 years ago

brettChapman commented 2 years ago

Hi

While running vg autoindex I get complaints about the sequence length being too long. I had the same problem when running vg gbwt, so I set the buffer size to 1000.

Could the same parameter be set with vg autoindex?

I'm running vg version 1.35.

Thanks.

xchang1 commented 2 years ago

Sure no worries, it can wait until after the holidays. Have a good break!

brettChapman commented 2 years ago

Thanks, you too!

brettChapman commented 2 years ago

How is the indexing coming along with the GBZ file I provided?

I tried generating the GBZ file again, and I realise now it's generated from the full graph. Not the pruned graph. Using odgi prune does not leave paths in the resulting graph, same with vg prune, so pruning my graph to reduce complexity and memory overhead isn't an option as the indexing just fails on the pruned graph.

xchang1 commented 2 years ago

Hey, sorry I've been out, I'm just getting to this now. I've made some changes but so far I haven't been able to build a distance index either

vg prune isn't really meant for the giraffe indexes. You could try using vg clip to take out big deletion edges; this might make the snarls smaller and more manageable Which graph construction method did you use for this graph? The minigraph-cactus pipeline is better suited for giraffe than the pggb one.

brettChapman commented 2 years ago

That's ok.

I'll look into vg clip. I came across vg simplify, would that help too to remove large snarls?

I used PGGB. Over a year ago I tried Cactus but I ran out of memory while using it. I also looked into minigraph but saw the memory requirements escalate fast. PGGB has been the only tool I've managed to get running which hasn't been killed off by limited memory. I saw minigraph-cactus is a new iteration of the Cactus pipeline, so maybe this time around it would work. The problem is my project needs wrapping up now and I can't afford to go back and start from scratch. I'm currently getting a paper out right now based on the PGGB results, and will later follow up with a paper using Giraffe given other varieties to query the pangenome graph with and also delve into the pan-transcriptome using vg rna as well for another paper. I'd prefer to keep to the same pangenome graph for these later papers so there's continuity with the story and pipelines used. I'm working alongside other groups so there's a bit of pressure mounting to get the project to it's conclusion.

xchang1 commented 2 years ago

Ah got it. I'll keep trying with the pggb graph then. Do you have the snarl index for that one? Building it myself is pretty slow

vg simplify probably won't help with the big snarls

xchang1 commented 2 years ago

If that's too much trouble, can you use this docker container (xhchang/vg:get_snarl_nodes) and run this command vg nodes -g [graph.xg] -s [graph.snarls] >nodes.txt and send me nodes.txt That should write all the nodes in the biggest snarl to a file so I can get a subgraph that will hopefully be easier to work with.

brettChapman commented 2 years ago

Is the snarl index generated using vg snarls?

vg snarls -t 2 -T pangenome.xg > pangenome.snarls

Because I have that already.

xchang1 commented 2 years ago

Yep, that one

brettChapman commented 2 years ago

I've sent you the pangenome.snarls index and also the nodes.txt files, linked from a google drive.

xchang1 commented 2 years ago

Last thing (probably): Can you send me the xg too please?

brettChapman commented 2 years ago

Sure, it is 136GB though. I'll try gzipping it and see how I go. If I can't compress it enough we'll have to look at other ways of getting it to you, because I may not have enough Google drive or free HDD space for it.

brettChapman commented 2 years ago

I've managed to tar gzip the file and have emailed you the link to Google Drive.

ekg commented 2 years ago

Could you clip out the high depth regions of your pangenome graph (e.g. centromeres)? Or is the problem more about having giant snarls / tangles?

Surely these graphs from pggb can be improved in many ways. And that's ongoing work. But there is a limit to how easy the graphs will get because we can't make biology simpler. Is there a plan to adapt giraffe to work on any kind of graph? I know it was able to run on a yeast cactus alignment, but that's orders of magnitude simpler than the graphs we get for large (plant) genomes. A different kind of distance index may be a requirement.

One trick could be to partition the graph into non-looping segments and build the index across these.

On Tue, Feb 1, 2022, 07:55 Brett Chapman @.***> wrote:

I've managed to tar gzip the file and have emailed you the link to Google Drive.

— Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3455#issuecomment-1026528521, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEL6LXAIFHDKZKRGDUTUY57WPANCNFSM5GKQCYGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

xchang1 commented 2 years ago

Hey @brettChapman, I just pushed a change that should make distance indexing use less memory. You can find a docker container for it on dockerhub: xhchang/vg:for-brett Mapping will likely be very slow but it should work

brettChapman commented 2 years ago

Thanks @xchang1 I'll give it a try and let you know how it goes. Do you have an idea just how much slower it would be?

xchang1 commented 2 years ago

It's hard to say, it depends on how many reads end up in the big snarls and how complicated the snarls are. The last time I tried it was about 10 times slower but I don't think my graph was as gnarly as yours

brettChapman commented 2 years ago

Ok thanks @xchang1

is vg index -s still a node size limit or do I specify the snarls index? if a node size limit, would -s 1000 be sufficient based on your testing?

xchang1 commented 2 years ago

Yeah, that will still work. I used -s 3000 (my biggest snarl was 2500) and it was fine

brettChapman commented 2 years ago

Hi @xchang1

Just providing an update on my distance indexing. I've tried multiple different snarl size limits. Based on vg stats my largest snarl is 611408. I figured I should test with -s 10000 or some other large cut-off however I still ended up crashing vg with memory errors. I then decided to reduce the cut-off heavily to -s 1000, and found I don't get any crashes (yet) but the vg index job has now been running for a few weeks. I feel perhaps if I lower the threshold further the job would just take even longer, but increasing it would just then lead to a crash.

Have there been any further developments in the area of distance indexing I could try?

@ekg I remember somewhere you mentioned about setting a higher segment length with PGGB to avoid very large snarls? My segment length with PGGB I set to 1,000,000bp. I'd rather avoid rerunning from the start again, but I may have no choice if I can't get distance indexing to work. Perhaps using the latest PGGB version may help as well, and a segment length of 2Mbp instead of 1Mbp?

xchang1 commented 2 years ago

You can try it with this new docker container xhchang/vg:distance-limit

I added an option to vg index called --distance-limit. This version will stop a traversal of a snarl when the distance is greater than the distance limit, which means that it will also not be accurate for distances larger than the distance limit. This shouldn't be a problem since giraffe doesn't need to be accurate over large distances.

If you're using paired end reads, then you want the distance limit to be a little larger than the distance between a pair of reads. For the reads I was using, the average fragment distance was about 500, so I'd probably set it to 1000 for my reads. I set the default to 5000 but I haven't tested that beyond making sure it runs with my graphs

brettChapman commented 2 years ago

Thanks @xchang1 I'll try the new build and see how it goes. Does it still have the -s parameter and does it take a snarl file or a single snarl size?

xchang1 commented 2 years ago

Yeah, it's the same -s with the snarl size

brettChapman commented 2 years ago

Ok thanks. So if I were to go back and use larger -s values in the range of my largest snarl in the 100s of thousand, I could do something like: vg index -s 100000 --distance-limit 5000 or would you suggest I use a smaller -s value, even though my largest snarl is 611408?

brettChapman commented 2 years ago

I believe my read fragment insert size is about 500bp from HiSeq 3000 library.

xchang1 commented 2 years ago

I'd still recommend using a smaller -s value. The -s value will also determine the size of the final index, so a larger one could make the index too big, even if the memory of constructing it is reasonable.

brettChapman commented 2 years ago

Hi @xchang1 and @ekg

I've tried to index the genome graph using multiple different values, low and high for -s and --distance-limit but I continue to get segmentation faults with requests for >2TB RAM. I've now resorted to going back to building the genome graph again in hopes different parameters and the latest build of PGGB might resolve the issue.

Do you have any suggestions with PGGB parameters to reduce the snarl size of the resulting graph?

My initial graph used PGGB v0.1.3 and the latest graph I'm producing is using v0.3.1.

The first graph I produced used a segment length of 1,000,000bp, 95% identity, default poa target length, and 0bp minimum block size, minimum match length of 316bp, poa_params="1,4,6,2,26,1", and the second graph I'm producing I used a segment length of 100,000bp, 95% identity, poa target length of "13117,13219", min match length of 316bp, poa_params="1,4,6,2,26,1", and the default minimum block length (5 * segment length).

So far chromosome 1H has completed and I've ran vg stats on them. The new graph size appears larger and the graph topology looks a lot busier.

vg stats results:

Old graph total number of snarls: 27,520,684
New graph total number of snarls: 43,673,533

Old graph largest snarl size: 74,735,218
New graph largest snarl size: 157,202,314

It looks as though by reducing the segment length, and perhaps introducing the other different parameters than the last graph I produced I've increased the total number of snarls and the largest snarl size considerably.

Would you recommend I try a much higher segment length of maybe twice the size as I had before (2,000,000bp), perhaps lower the percent identity threshold? (chromosomes like chr7H are much more divergent), or consider different POA parameters? If I can get an idea of which parameters to focus on to reduce the overall complexity of the graph and reduce the snarl size that would be great! Thanks.

brettChapman commented 2 years ago

Chromosome 4H completed, this time I'm seeing more snarls but with the largest snarl being smaller than the largest in the old graph, but not by a huge margin.

Chromosome 4H
Old graph total number of snarls: 30,294,864
New graph total number of snarls: 48,135,988

Old graph largest snarl size: 112,125,847
New graph largest snarl size: 93,791,119

Old 4H graph: barley_pangenome_4H fasta 5afc036 7715ffd 41ba588 smooth og viz_inv

New 4H graph: barley_pangenome_4H fasta a28e8c8 371d99c 1dab94c smooth final og viz_inv_multiqc

xchang1 commented 2 years ago

Hi Brett,

Sorry you're having so much trouble with this. I'm out of ideas for quick ways to reduce the memory use. Maybe Erik can help simplify the pggb graph but I would still suggest building the graph using the minigraph cactus pipeline ( https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md). It prunes complex regions so it won't represent the full sequence space the way the pggb graph will, but it will be much smaller and easier to work with.

On Sat, Jun 25, 2022 at 8:24 PM Brett Chapman @.***> wrote:

Chromosome 4H completed, this time I'm seeing more snarls but with the largest snarl being smaller than the largest in the old graph, but not by a huge margin.

Chromosome 4H Old graph total number of snarls: 30,294,864 New graph total number of snarls: 48,135,988

Old graph largest snarl size: 112,125,847 New graph largest snarl size: 93,791,119

Old 4H graph: [image: barley_pangenome_4H fasta 5afc036 7715ffd 41ba588 smooth og viz_inv] https://user-images.githubusercontent.com/8529807/175798044-b55bd016-75be-4123-97e3-517215dc2a92.png

New 4H graph: [image: barley_pangenome_4H fasta a28e8c8 371d99c 1dab94c smooth final og viz_inv_multiqc] https://user-images.githubusercontent.com/8529807/175798057-9dd519df-db34-4f4d-9ec2-ec9bf77293d9.png

— Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3455#issuecomment-1166408547, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIGEATEL7YBF4FKF44HMDZ3VQ7EM7ANCNFSM5GKQCYGA . You are receiving this because you were mentioned.Message ID: @.***>

brettChapman commented 2 years ago

Thanks @xchang1 I'll look into minigraph cactus.

I think there may be another way too by using ODGI to remove complex regions. I came across this tutorial online for removing complex centromere regions: https://odgi.readthedocs.io/en/latest/rst/tutorials/remove_artifacts_and_complex_regions.html

Also this discussion: https://bytemeta.vip/repo/vgteam/odgi/issues/381

@ekg would the resulting graph after using odgi depth and odgi extract to remove complex regions be compatible with vg index and giraffe read mapping?

brettChapman commented 2 years ago

Hi @xchang1 @ekg

I'm giving pruning the graph a try first, which seems to have worked, removing complex regions. I may have to tweak the prune parameters a bit to remove enough of complex regions, but I think it should work in theory, as it will retain everything from a single chosen reference and remove high complex regions from all other genomes. While indexing with the distance-limit fork of vg, I found some errors pop up during the run, like there were unexpected lines in the graph e.g 'pruned' no file found and some number like this: 1567800 no file found. Perhaps the pruned graph has added lines in the graph that vg isn't expecting?

brettChapman commented 2 years ago

Hi @xchang1, @ekg and @AndreaGuarracino

I've managed to produce a pangenome graph using cactus-minigraph based on @xchang1 suggestion and I generated a VG graph successfully. I've tried visualising the graph using ODGI but I've been getting segmentation faults, even when running on a node with 1.4TB RAM.

I run the following:

singularity exec --bind ${PWD}:${PWD} ${PGGB_IMAGE} vg convert ${VG_INPUT} -f > ${GFA_OUTPUT}

singularity exec --bind ${PWD}:${PWD} ${PGGB_IMAGE} gfaffix ${GFA_OUTPUT} -o ${GFA_FIX_OUTPUT}

singularity exec --bind ${PWD}:${PWD} ${PGGB_IMAGE} odgi build -t 20 -P -g ${GFA_FIX_OUTPUT} -o ${ODGI_OUTPUT} -O

singularity exec --bind ${PWD}:${PWD} ${PGGB_IMAGE} odgi sort -P -p Ygs -t 20 -i ${ODGI_OUTPUT} -o ${ODGI_SORTED} 

singularity exec --bind ${PWD}:${PWD} ${PGGB_IMAGE} odgi viz -i ${ODGI_SORTED} -o ${ODGI_SORTED}.viz_multiqc.png -x 1500 -y 500 -a 10

singularity exec --bind ${PWD}:${PWD} ${PGGB_IMAGE} odgi viz -i ${ODGI_SORTED} -o ${ODGI_SORTED}.viz_pos_multiqc.png -x 1500 -y 500 -a 10 -u -d

singularity exec --bind ${PWD}:${PWD} ${PGGB_IMAGE} odgi viz -i ${ODGI_SORTED} -o ${ODGI_SORTED}.viz_depth_multiqc.png -x 1500 -y 500 -a 10 -m

singularity exec --bind ${PWD}:${PWD} ${PGGB_IMAGE} odgi viz -i ${ODGI_SORTED} -o ${ODGI_SORTED}.viz_inv_multiqc.png -x 1500 -y 500 -a 10 -z

Some images are generated but are not viewable, likely corrupted.

Is there another way I could visualize the cactus-minigraph graphs, or different parameters I could use with ODGI?

Or is this just not possible due to the very different graphs each approach generates?

I'd like to visually compare the PGGB graph and cactus-minigraph graphs, and hopefully use one of the graphs to align reads using Giraffe.

I'm also trying larger segment lengths with PGGB to reduce the number of large snarls in the graph in hopes of being able to index the graph successfully. I upped the segment length all the way to 100Mbp, which heavily reduced the complexity a lot and removed those large loops in the graph topology which often span the entire chromosome, but I noticed that variations (such as inversions) which were once visible disappeared, and the graph appeared a lot more fragmented. I take it segment lengths ideally should be around the size of the largest SV. I've now reduced to 20Mbp to see if I can find a good compromise. I'm testing at the moment with chr1H only until I find a optimal segment length.

Thanks.

glennhickey commented 2 years ago

Hi @brettChapman

With PGGB, I think most processing, including visualization, is done at chromosome scale. cactus-graphmap-join (which fwiw runs gfaffix for you if you pass --gfaffix) will produce the chromosome graphs in a subfolder of the output. If you haven't already, you may want to try visualizing them individually rather than passing in the whole graph.

One other thing thing to consider is the clipping. when working with larger genomes, I recommend sticking with the procedure used for the HPRC graphs detailed here

Which leaves you with 3 (!) graphs where each is a subset of the previous:

The full graph may have many large insertions due to unaligned centromeres that give odgi viz trouble, so you may want to try the one without centromeres if it doesn't work. But the allele filtered graph, despite being simpler, may be more difficult to view due to the more fragmented paths so you might want to avoid that one.

Hope this helps.

brettChapman commented 2 years ago

Hi @glennhickey Thanks for your feedback, I'll look into your suggestion. The HPRC notes look new, as I was following the mammal and yeast tutorial a few weeks ago. In regards to my graphs, I've generated each graph for each chromosome based on the tutorial here: https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md

I plan to join the graphs later when attempting to index and use with Giraffe. At the moment I was attempting to visualise each of the chromosome graphs.

My current procedure for generating a graph for each chromosome is running these commands for each set of chromosomes:

srun -n 1 singularity exec --cleanenv \
            --no-home \
            --overlay ${JOBSTORE_IMAGE} \
            --bind ${CACTUS_SCRATCH}/tmp:/tmp \
            ${CACTUS_IMAGE} cactus-minigraph /cactus/jobStore ${GUIDE_TREE} ${GFA_FILE} --realTimeLogging --workDir=/cactus/workDir --clean always --cleanWorkDir always --defaultDisk 1000G --reference ${REFERENCE} --maxDisk 1000G --maxCores 2 --maxMemory 126G --defaultMemory 126G

srun -n 1 singularity exec --cleanenv \
            --no-home \
            --overlay ${JOBSTORE_IMAGE} \
            --bind ${CACTUS_SCRATCH}/tmp:/tmp \
            ${CACTUS_IMAGE} cactus-graphmap /cactus/jobStore ${GUIDE_TREE} ${GFA_FILE} ${PAF_FILE} --outputFasta ${FASTA_FILE} --realTimeLogging --workDir=/cactus/workDir --mapCores 16 --clean always --cleanWorkDir always --defaultDisk 1000G --maxDisk 1000G --reference ${REFERENCE} --maxCores 2 --maxMemory 126G --defaultMemory 126G

srun -n 1 singularity exec --cleanenv \
                        --no-home \
                        --overlay ${JOBSTORE_IMAGE} \
                        --bind ${CACTUS_SCRATCH}/tmp:/tmp \
                        ${CACTUS_IMAGE} cactus-align /cactus/jobStore ${GUIDE_TREE} ${PAF_FILE} ${HAL_FILE} --pangenome --pafInput --outVG --reference ${REFERENCE} --realTimeLogging --workDir=/cactus/workDir --clean always --cleanWorkDir always --defaultDisk 1000G --maxDisk 1000G --maxCores 2 --maxMemory 126G --defaultMemory 126G

I get a HAL file and a VG file for each chromosome. I then followed the rest of the tutorial using cactus-graphmap-join to join the graph ready for indexing.

Based on your suggestion, it sounds like I should be going onto joining the graphs which will then give me multiple sub-graphs for each chromosome, including the full graph, is that right? The resulting graphs after running cactus-graphmap-join, should give me something I could then visualize with?

glennhickey commented 2 years ago

I see, so you're manually running the chromosomes rather than using cactus-graphmap-split and cactus-align-batch? That's fine, but you probably still want to look at the chromosome graphs as output by cactus-graphmap-join --reference REFERENCE --wlineSep "." --clipLength 10000 --clipNonMinigraph --gfaffix (as referred in that link).

brettChapman commented 2 years ago

Yeah, I'm running each chromosome manually instead of starting with whole genomes. I previously ran with whole genomes with progressive cactus about 2 yrs ago, and I ran into memory limits, so I'm running each chromosome separately on each node of my cluster which only has 128GB RAM per node. I have a larger machine now with 1.4TB RAM but it's a shared resource and I don't want to tie it up indefinitely, so I'll run on the larger machine when necessary, which I may have to when running cactus-graphmap-join.

Thanks, I'll definitely run cactus-graphmap-join now with those parameters before attempting to visualize the graphs again. I'll update again once I run it. Some other chromosomes are still running.

brettChapman commented 1 year ago

Hi @glennhickey

I've just completed a run using these parameters:

cactus-graphmap-join /cactus/jobStore --indexCores 30 --vg ${original_folder}/*H/*.vg --outDir ${original_folder}/barley-pg --outName barley-pg --reference Morex_v3 --wlineSep "." --gfaffix --disableCaching --workDir=/cactus/workDir

I'm now running again to get filtered graphs and indexes for giraffe:

cactus-graphmap-join /cactus/jobStore --indexCores 30 --vg ${original_folder}/*H/*.vg --outDir ${original_folder}/barley-pg --outName barley-pg --reference Morex_v3 --wlineSep "." --clipLength 10000 --clipNonMinigraph --vcf --giraffe --gfaffix --disableCaching --workDir=/cactus/workDir

My output from the first run I did to get the full graph gave me these files:

drwxrwxr-x 2 ubuntu ubuntu 4.0K Dec 17 09:43 clip-barley-pg
-rw-rw-r-- 1 ubuntu ubuntu  17G Dec 17 10:00 barley-pg.gfa.gz
-rw-rw-r-- 1 ubuntu ubuntu 9.0G Dec 17 10:09 barley-pg.gbwt
-rw-rw-r-- 1 ubuntu ubuntu  33G Dec 17 10:40 barley-pg.gg
-rw-rw-r-- 1 ubuntu ubuntu 3.9G Dec 17 10:44 barley-pg.trans.gz
-rw-rw-r-- 1 ubuntu ubuntu  28G Dec 17 11:11 barley-pg.xg
-rw-rw-r-- 1 ubuntu ubuntu 881M Dec 17 11:12 barley-pg.snarls

What is the .gg file for, is that the distance index? and the .trans.gz file?

In the clip-barley-pg/ folder I have:

-rw-rw-r-- 1 ubuntu ubuntu 8.1G Dec 17 08:33 barley_pangenome_chr1H.vg
-rw-rw-r-- 1 ubuntu ubuntu  13G Dec 17 08:46 barley_pangenome_chr2H.vg
-rw-rw-r-- 1 ubuntu ubuntu  11G Dec 17 08:57 barley_pangenome_chr3H.vg
-rw-rw-r-- 1 ubuntu ubuntu 9.3G Dec 17 09:06 barley_pangenome_chr4H.vg
-rw-rw-r-- 1 ubuntu ubuntu  11G Dec 17 09:18 barley_pangenome_chr5H.vg
-rw-rw-r-- 1 ubuntu ubuntu  11G Dec 17 09:29 barley_pangenome_chr6H.vg
-rw-rw-r-- 1 ubuntu ubuntu  13G Dec 17 09:43 barley_pangenome_chr7H.vg
-rw-rw-r-- 1 ubuntu ubuntu 115M Dec 17 09:43 clip-stats.bed
-rw-rw-r-- 1 ubuntu ubuntu 110M Dec 17 09:43 path-stats.tsv
-rw-rw-r-- 1 ubuntu ubuntu  821 Dec 17 09:43 graph-stats.tsv
-rw-rw-r-- 1 ubuntu ubuntu 110M Dec 17 09:43 sample-stats.tsv

How can I make use of these files, or are they intermediate files during the run? I noticed the .vg files are not the exact same as the .vg I input. Are these clipped graphs, with the accompanying clipped graph statistics? Should these graphs now be usable with ODGI viz?

Thanks.