pangenome / pggb

the pangenome graph builder
https://doi.org/10.1101/2023.04.05.535718
MIT License
346 stars 37 forks source link

Memory size required to build large genomes #385

Open Lucio-Yang opened 3 months ago

Lucio-Yang commented 3 months ago

Hi!

I found this error (Command terminated by signal 9) when running PGGB, which may be caused by insufficient memory.

I want to know how much memory is needed if I want to use PGGB to construct a graph genome for 10 genomes each of 10 Gb. Thanks !

ekg commented 3 months ago

If you want to do it all together in one step for genomes this large, it might take several TB.

pggb isn't designed to handle a single input of this size.

You can partition the problem by chromosome or by community. CC @AndreaGuarracino for docs

We are building a new method to subdivide the graph building problem into very small pieces that can be run in parallel. See https://github.com/ekg/impg

Lucio-Yang commented 3 months ago

If you want to do it all together in one step for genomes this large, it might take several TB.

pggb isn't designed to handle a single input of this size.

You can partition the problem by chromosome or by community. CC @AndreaGuarracino for docs

We are building a new method to subdivide the graph building problem into very small pieces that can be run in parallel. See https://github.com/ekg/impg

Thanks for your quickly reply!

Because we found that there are large or small segment translocations between the chromosomes of these genomes, will this part of the information be lost if the chromosomes are constructed separately, and it is not clear which chromosomes are translocated.

AndreaGuarracino commented 3 months ago

Lucio, which pggb step is out-of-memorying?

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg


From: Lucio @.> Sent: Tuesday, April 2, 2024 10:05:17 PM To: pangenome/pggb @.> Cc: Andrea Guarracino @.>; Mention @.> Subject: Re: [pangenome/pggb] Memory size required to build large genomes (Issue #385)

If you want to do it all together in one step for genomes this large, it might take several TB.

pggb isn't designed to handle a single input of this size.

You can partition the problem by chromosome or by community. CC @AndreaGuarracinohttps://github.com/AndreaGuarracino for docs

We are building a new method to subdivide the graph building problem into very small pieces that can be run in parallel. See https://github.com/ekg/impg

Thanks for your quickly reply!

Because we found that there are large or small segment translocations between the chromosomes of these genomes, will this part of the information be lost if the chromosomes are constructed separately, and it is not clear which chromosomes are translocated.

— Reply to this email directly, view it on GitHubhttps://github.com/pangenome/pggb/issues/385#issuecomment-2033403988, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AO26XHWFCMOSHFVUGUV3R3LY3NPV3AVCNFSM6AAAAABFULDYFOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZTGQYDGOJYHA. You are receiving this because you were mentioned.Message ID: @.***>

Lucio-Yang commented 3 months ago

Lucio, which pggb step is out-of-memorying? Sent from Outlook for Androidhttps://aka.ms/AAb9ysg ____ From: Lucio @.> Sent: Tuesday, April 2, 2024 10:05:17 PM To: pangenome/pggb @.> Cc: Andrea Guarracino @.>; Mention @.> Subject: Re: [pangenome/pggb] Memory size required to build large genomes (Issue #385) If you want to do it all together in one step for genomes this large, it might take several TB. pggb isn't designed to handle a single input of this size. You can partition the problem by chromosome or by community. CC @AndreaGuarracinohttps://github.com/AndreaGuarracino for docs We are building a new method to subdivide the graph building problem into very small pieces that can be run in parallel. See https://github.com/ekg/impg Thanks for your quickly reply! Because we found that there are large or small segment translocations between the chromosomes of these genomes, will this part of the information be lost if the chromosomes are constructed separately, and it is not clear which chromosomes are translocated. — Reply to this email directly, view it on GitHub<#385 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AO26XHWFCMOSHFVUGUV3R3LY3NPV3AVCNFSM6AAAAABFULDYFOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZTGQYDGOJYHA. You are receiving this because you were mentioned.Message ID: @.***>

I think it’s the "sorting path_mappings". image

subwaystation commented 3 months ago

Which version are you using?

Lucio-Yang commented 3 months ago

Which version are you using?

pggb 0.5.4

AndreaGuarracino commented 3 months ago

Can you update to pggb v0.6.0? On Aug 28, 2023 I've merged a strong smoothxg refactoring (https://github.com/pangenome/smoothxg/pull/197) to reduce memory consumption and parallelize the path embedding. This version "might" (not sure) go beyond the point where you are currently out-of-memorying.

Lucio-Yang commented 3 months ago

I can update the latest version and try again. Can PGGB be run in parallel on multiple nodes?

AndreaGuarracino commented 3 months ago

You can parallelize the alignment step on multiple nodes in two ways:

AndreaGuarracino commented 3 months ago

Or you could try to repeat only the normalization step by updating PGGB and running the same command by specifying pggb .... --resume.

subwaystation commented 3 months ago

You can parallelize the alignment step on multiple nodes in two ways:

* manual way: following the tips explained at https://github.com/waveygang/wfmash?tab=readme-ov-file#running-wfmash-on-a-cluster. You run the mapping step on a node, align on multiple nodes, merge the results and run PGGB on a node by giving in input both FASTA and PAF.

* automated way: using nextflow flavour of PGGB [https://github.com/nf-core/pangenome](https://github.com/nf-core/pangenome/tree/1.1.2), but it is not 100% identical and updated at the moment

nf-core/pangenome has wfmash currently pinned at v0.10.4. This should give a more stable experience, since the parameter space of the latest wfmash release was not explored thoroughly. Other differences should be of a cosmetic nature. If not please ping me.

Lucio-Yang commented 3 months ago

Hi! I reused PGGB 0.6.0 and it has been running for 12 days without any errors. If I run PGGB on each chromosome separately, how do I merge multiple GFA files? Thanks!

AndreaGuarracino commented 3 months ago

odgi squeeze!

ekg commented 3 months ago

It's not ready yet, but we have a partitioning approach based on the implicit graph (impg) that should mostly solve these kinds of issues. It will take some learning to figure out the best configuration though.