pangenome / pggb

the pangenome graph builder
https://doi.org/10.1101/2023.04.05.535718
MIT License
346 stars 37 forks source link

how to construct pangenome graph from two assembly genome of two species #147

Open Aannaw opened 2 years ago

Aannaw commented 2 years ago

Hello,Professor I have two assembly genome of two species. I want to construct a graph from the two genome using PGGB. I firstly combine the two genome into a input.fa ,then index the input.fa , finally run the PGGB. I refer to the human example in readme command is : cat A.fasta B.fasta > input.fasta samtools faidx input.fasta pggb -i input.fata -s 100000 -p 70 -t 40 -v -L -U -o out -T 20 -n 2 -H 2 -G 20000 I have used "mash dist" to calculate divergence between A.fasta and B.fasta. The result is: A.fasta B.fasta 0.0181313 0 519/1000 I do not know how to convert the result to the approximate percent identity and then provide it as -p and how to adapt to these parameters "-k -s -G". I set the n to 2 according to the number of my assembly genomes. And I can not run by chromosome because I can not find the related chromosome from the two genome. Also, can you show me the use of memory and CPU. The size of My two genome is about 3G. I would appreciate it if you could give me any suggestions. Looking forward with your reply.

ekg commented 2 years ago

Thank you for reaching out with your questions. Here are some suggestions.

The mash distance estimate would be matched by setting -p 90 or -p 95.

You might want to set -n 1, I'm not certain that for two genomes (or small numbers) that it should be == -H. We tend to "oversaturate" the mapping slightly when working with larger numbers of genomes (e.g. setting number of mappings == number of genomes), but here it might be best to set -n 1 -H 2, and frankly I'm thinking that might make sense as best practice going forward, but I haven't yet tested it.

I'll try to work your feedback into documentation about how to go from mash dist to settings. Really, we want to automated this, and it's probably possible to do for the mash distance. The segment length determination is somewhat arbitrary, because it depends on the lengths of homologies that you want to support as approximately linear in the graph.

-G 20000 might cause you to run out of memory. abPOA (at least as we're running it) has some quadratic memory costs in the length of the segment. For HPRC and work on mouse (20-90 haplotypes) we use -G 13117,13219 on a system with 386GB of RAM and tend to use around half of memory (~150G) at peak. The successive numbers indicate two smoothxg passes with different target abPOA lengths. In practice, these passes really help normalize the graph well.

I would expect this process to take a day or so on one system, possibly less given that you just have two genomes. For human and mouse I've been partitioning the contigs by chromosome, but it should be fine to directly build the graph from everything.

Let me know how it goes and if you need any more hints. I'll try to roll your perspective into an update to the pggb documentation.

Aannaw commented 2 years ago

Hello,Professor Thanks for your reply! I have just runned the command you recommend : pggb -i input.fasta-s 100000 -p 95 -t 40 -v -L -U -o out -T 20 -n 1 -H 2 -G 13117,13219 I still have any confusion about partitioning the contigs by chromosome. My initial two genomes is scaffolding to several pseudo-chromosomes and some unlocalized-contigs. After combing the two genomes, there are a pair of Chr1, Chr2,Chr3 ..... in the combined input.fasta. If I choose to run pggb by chromosome, should I put the pair of Chr1 sequences into a Chr1.fasta. So what about the unlocalized-contigs? Also, I have another question. When I run the command above, the first step of wfmash seems to need a long time. The tmp generate file wfmash-KnfItp is empty. I choose to When I used htop to check , the state is sleeping. Is it out of memory? Best wishes!

ekg commented 2 years ago

I'm not sure what would cause a stall at that stage. That's very strange. What does the output log say it's doing.

It is strongly recommended that you run pggb in a directory with fast disk. Ideally an SSD. That can cause apparent stalls.

Did you run out of memory during smoothxg? I'm curious why you set -T 20 to reduce the parallelism of that step.

On Tue, Dec 28, 2021, 05:17 Aannaw @.***> wrote:

Hello,Professor Thanks for your reply! I have just runned the command you recommend : pggb -i input.fasta-s 100000 -p 95 -t 40 -v -L -U -o out -T 20 -n 1 -H 2 -G 13117,13219 I still have any confusion about partitioning the contigs by chromosome. My initial two genomes is scaffolding to several pseudo-chromosomes and some unlocalized-contigs. After combing the two genomes, there are a pair of Chr1, Chr2,Chr3 ..... in the combined input.fasta. If I choose to run pggb by chromosome, should I put the pair of Chr1 sequences into a Chr1.fasta. So what about the unlocalized-contigs? Also, I have another question. When I run the command above, the first step of wfmash seems to need a long time. The tmp generate file wfmash-KnfItp is empty. I choose to When I used htop to check , the state is sleeping. Is it out of memory? Best wishes!

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/147#issuecomment-1001860566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEKZLZTBV3P4EKULJWDUTE26XANCNFSM5KZ6O5DA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

AndreaGuarracino commented 2 years ago

Hi @Aannaw, can you share your input (input.fasta) and specify which version (which commit) of pggb you are using?

Aannaw commented 2 years ago

Hello,Professors I am so sorry for delayed reply. It seems to work and it is running the command "smoothxg -t 40 -T 20 -g out/input.fasta.15ccfd3.2ff309f.seqwish.gfa -w 26234 -K -X 100 -I 0.95 -R 0 -j 0 -e 0 -l 13117 -p 1,19,39,3,81,1,1 -o 0.03 -Y 200 -d 0 -D 0 -V -o out/input.fasta.15ccfd3.2ff309f.6754527.smooth.1.gfa" by checking with htop. The input.fasta.15ccfd3.2ff309f.6754527.smooth.1.gfa has not generated. I initially set -T 20 for that I seen the smoothxg consume a huge amount of memory in the POA step in the readme.md and so I set -T to minimize the threads. @AndreaGuarracino Hello, Professor I installed noarch/pggb-0.2.0-hdfd78af_0.tar.bz2 by conda. Because of big size of input.fasta, if any need, I will sent to you by mail. Thanks again for your help. Best wishes!

Aannaw commented 2 years ago

Hello,Professor The final .smooth.gfa graph were generated. And the following visualization of 1D and 2D has also been generated.
It seems that mapping between pseudo-chromosomes or scaffolds of the same assembly genomes. Is it true? Are there explanation and assessment about the output graph (
.smooth.gfa) ? Maybe should I run pggb by same pseudo-chromosomes or scaffolds from two genome independently? Looking forward with your reply. Best wishes! 未命名1640857306 Ma6-Mp5 all fata 15ccfd3 2ff309f 6754527 smooth og viz_inv Ma6-Mp5 all fata 15ccfd3 2ff309f 6754527 smooth og lay draw