Building a better pangenome

GeorgeBGM commented 1 year ago

Hi, How to add new sample genomes and contigs to an existing pan-genome producted by PGGB, and whether it can be done directly using the Minigraph or GraphAligner tool. Any suggestions on how to do this.

subwaystation commented 1 year ago

Hi @George-du,

there are several possibilities:

Rebuild the whole graph with the new sample genomes and contigs added.
You might be able to map against the graph using GraphAligner, but only if you masked all complex and repetitive regions in your input sequences. Else it won't scale.
minigraph is not an option here, because it is reference-based and only accepts rGFA.

I would recommend the first option, though I am aware of the computational overhead.

GeorgeBGM commented 1 year ago

Thank you for your reply, I will take your suggestion and feel that adding the new function of PGGB to add new samples will be very helpful and useful.

ekg commented 1 year ago

A future option would be to only generate alignments that would be induced by the addition of the new samples. This would be helpful because most of the runtime is dependent upon the quadratic, all2all alignment.

On Tue, Jun 20, 2023, 05:23 George-du @.***> wrote:

Thank you for your reply, I will take your suggestion and feel that adding the new function of PGGB to add new samples will be very helpful and useful.

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/306#issuecomment-1598053963, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEM2Z7HU4VGK6G3KGELXMEJSDANCNFSM6AAAAAAZLA3VQ4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

GeorgeBGM commented 1 year ago

Sounds great, thanks so much for your help.

GeorgeBGM commented 1 year ago

Hi, I am using the PGGB process to split chromosomes to build a pan-genome and the smoothxg software is generating errors on some of the chromosomes. do you have some suggestions about these reported errors?Thanks!

Software : Smoothxg(v0.6.8-0-ga8a0e9e) Error1: smoothxg -t 30 -T 30 -g ./graphs/chrY.pan/chrY.pan.new.fa.gz.bf8016f.04f1c29.seqwish.gfa -r 114 --base ./graphs/chrY.pan --chop-to 100 -I .9800 -R 0 -j 0 -e 0 -l 700,900,1100 -P 1,19,39,3,81,1 -O 0.03 -Y 11400 -d 0 -D 0 -S -Q Consensus_ -V -o ./graphs/chrY.pan/chrY.pan.new.fa.gz.bf8016f.04f1c29.5ef21f9.smooth.gfa 259730.80s user 26577.66s system 82% cpu 345179.20s total 53544752Kb max memory

Error2: smoothxg -t 30 -T 30 -g ./graphs/chr9.pan/chr9.pan.new.fa.gz.2ca993e.04f1c29.seqwish.gfa -r 236 --base ./graphs/chr9.pan --chop-to 100 -I .9800 -R 0 -j 0 -e 0 -l 700,900,1100 -P 1,19,39,3,81,1 -O 0.03 -Y 23600 -d 0 -D 0 -S -Q Consensus_ -V -o ./graphs/chr9.pan/chr9.pan.new.fa.gz.2ca993e.04f1c29.03ca4fb.smooth.gfa

GeorgeBGM commented 1 year ago

Hi, I'm curious if I described the problem clearly and if there are some suggestions about the solution to this problem?

GeorgeBGM commented 1 year ago

Hi, developers! What should I do to avoid the above reported error？ Should I re-run the smoothxg program without the -Q Consensus_ parameter and the -O 0, or do I need to reduce my mash length from 50kb to 10kb (Reference: https://github.com/pangenome/pggb/issues/182). Are there some other suggestions? Besides, do these two strategies have a significant impact on the final result?

GeorgeBGM commented 11 months ago

Hi, @subwaystation @ekg ,

I tried the above strategy on human chromosome 13, but the smoothxg step is still giving errors at the moment. Are there any some suggestions about this problem or can I just use the results before the smoothxg step?

1) re-run the smoothxg program without the -Q Consensus_ parameter and the -O 0：

the command： smoothxg -t 30 -T 30 -g ./graphs/chr13.pan/chr13*seqwish.gfa -r 236 --base ./graphs/chr13.pan --chop-to 100 -I .9800 -R 0 -j 0 -e 0 -l 700,900,1100 -P 1,19,39,3,81,1 -O 0 -Y 23 -D 0 -o ./g13.pan/chr13.pan/chr13.pan.fa.gz.2ca993e.04f1c29.03ca4fb.smooth.gfa

the error message： [smoothxg::(1-3)::smooth_and_lace] embedding 79826 path fragments: 0.01% @ 2.81e+04/s elapsed: 00:00:00:00 remain: 00:00:00:02smoothxg: /opt/conda/conda-bld/smoothxg_1671059618733/work/src/smooth.cpp:2117: odgi::graph_t smoothxg::smooth_and_lace(const xg::XG&, smoothxg::blockset_t&, int, int, int, int, int, int, const bool&, const uint64_t&, float, uint64_t, bool, int, int, const string&, std::string&, bool, bool, double, bool, const string&, std::vector<std::__cxx11::basic_string >&, bool, uint64_t, const string&): Assertion `false' failed.

2) reduce my mash length from 50kb to 10kb：

the command： $RUN_PGGB -r -i /home/u20111010010/Project/Pan-genome/002.Merge_Pan_V2/Merge-V1/001.Sequence_partitioning/parts/chr$i.pan.new.fa.gz -o ./graphs/new_chr$i.pan -t 30 -p 98 -s 10000 -n 236 -k 311 -O 0.03 -T 30

the error message： [smoothxg::(1-3)::break_and_split_blocks] cutting and splitting 869849 blocks: 100.00% @ 4.33e+04/s elapsed: 00:00:00:20 remain: 00:00:00:00smoothxg: /opt/conda/conda-bld/smoothxg_1671059618733/work/build/sdsl-lite-prefix/src/sdsl-lite-build/include/sdsl/enc_vector.hpp:193: sdsl::enc_vector<t_coder, t_dens, t_width>::value_type sdsl::enc_vector<t_coder, t_dens, t_width>::operator[](sdsl::enc_vector<t_coder, t_dens, t_width>::size_type) const [with t_coder = sdsl::coder::elias_delta; unsigned int t_dens = 128; unsigned char t_width = 0; sdsl::enc_vector<t_coder, t_dens, t_width>::value_type = long unsigned int; sdsl::enc_vector<t_coder, t_dens, t_width>::size_type = long unsigned int]: Assertion `i < m_size' failed. Command terminated by signal 6

I'm looking forward to your reply. Best,Du

AndreaGuarracino commented 11 months ago

Can you try the same command lines, but installing PGGB via Docker/Singularity?

GeorgeBGM commented 11 months ago

Hi, developers!

I will try to install PGGB via Docker/Singularity, Do I need to install a specific version?

AndreaGuarracino commented 11 months ago

The latest version available, thanks!

GeorgeBGM commented 11 months ago

Got that. I'll try it again.

GeorgeBGM commented 11 months ago

Hi, @subwaystation @ekg @AndreaGuarracino,

I installed the latest PGGB (pggb 8eaf354) using Singularity with non-root privileges, but still get a similar error. The details of the reported error are as follows:

1.re-run the smoothxg program without the -Q Consensus_ parameter and the -O 0：(mash length: 50kb/10kb)

the command：

10kb RUN_PGGB="singularity exec /home/Software/pggb/pggb.simg pggb" $RUN_PGGB -r -i chr13.pan.new.fa.gz -o new_chr13.pan -t 45 -p 98 -s 10000 -n 236 -k 311 -O 0.03 -T 45

50kb singularity exec /home/Software/pggb/pggb.simg smoothxg -t 30 -T 30 -g chr13*seqwish.gfa -r 236 --base ./graphs/chr13.pan --chop-to 100 -I .9800 -R 0 -j 0 -e 0 -l 700,900,1100 -P 1,19,39,3,81,1 -O 0 -Y 23 -D 0 -o ./graphs/chr13.pan/chr13.pan.fa.gz.2ca993e.04f1c29.03ca4fb.smooth.gfa"

the error message:

10kb e+04 bp/s elapsed: 00:00:00:14 remain: 00:00:00:00 ^M[smoothxg::(2-3)::smooth_and_lace] embedding 395114099 path fragments: 0.00% @ 0.00e+00 bp/s elapsed: 00:00:00:00 remain: 00:00:00:00^M[smoothxg::(2-3)::smooth_and_lace] embedding 395114099 path fragments: 0.00% @ 2.25e+04 bp/s elapsed: 00:00:00:00 remain: 00:04:52:01smoothxg: /smoothxg/src/smooth.cpp:2551: odgi::graph_t smoothxg::smooth_and_lace(const xg::XG&, smoothxg::blockset_t&, int, int, int, int, int, int, const bool&, const uint64_t&, float, uint64_t, bool, int, int, const string&, std::string&, bool, bool, double, bool, const string&, std::vector<std::__cxx11::basic_string >&, uint64_t, const string&): Assertion `false' failed. Command terminated by signal 6 smoothxgINFO: Cleaning up image... -t 45 -T 45 -g ./graphs/new_chr13.pan/chr13.pan.new.fa.gz.402d19f.04f1c29.seqwish.gfa -r 236 --base ./graphs/newchr13.pan --chop-to 100 -I .9800 -R 0 -j 0 -e 0 -l 700,900,1100 -P 1,19,39,3,81,1 -O 0.03 -Y 23600 -d 0 -D 0 -Q Consensus -V -o ./graphs/new_chr13.pan/chr13.pan.new.fa.gz.402d19f.04f1c29.03ca4fb.smooth.gfa 1732131.33s user 1389324.41s system 2502% cpu 124738.32s total 286104940Kb max memory

50kb 02 remain: 00:00:00:04^M[smoothxg::(1-3)::smooth_and_lace] adding edges from 992731 graphs: 100.00% @ 3.97e+05 bp/s elapsed: 00:00:00:02 remain: 00:00:00:00 ^M[smoothxg::(1-3)::smooth_and_lace] embedding 76537735 path fragments: 0.00% @ 0.00e+00 bp/s elapsed: 00:00:00:00 remain: 00:00:00:00^M[smoothxg::(1-3)::smooth_and_lace] embedding 76537735 path fragments: 0.00% @ 1.97e+04 bp/s elapsed: 00:00:00:00 remain: 00:01:04:52smoothxg: /smoothxg/src/smooth.cpp:2551: odgi::graph_t smoothxg::smooth_and_lace(const xg::XG&, smoothxg::blockset_t&, int, int, int, int, int, int, const bool&, const uint64_t&, float, uint64_t, bool, int, int, const string&, std::string&, bool, bool, double, bool, const string&, std::vector<std::__cxx11::basic_string >&, uint64_t, const string&): Assertion `false' failed. INFO: Cleaning up image...

2. re-run the PGGB pipeline using Singularity:

the command：

RUN_PGGB="singularity exec /home/Software/pggb/pggb.simg pggb" $RUN_PGGB -r -i chr13.pan.new.fa.gz -o ./graphs/rerun-new_chr13.pan -t 45 -p 98 -s 10000 -n 236 -k 311 -O 0.03 -T 45

the error message:

[wfmash::skch::Map::mapQuery] count of mapped reads = 13369, reads qualified for mapping = 13641, total input reads = 13641, total input bp = 24623027601 [wfmash::map] time spent mapping the query: 3.71e+03 sec [wfmash::map] mapping results saved in: /dev/stdout wfmash -s 10000 -l 50000 -p 98 -n 235 -k 19 -H 0.001 -X -t 45 --tmp-base ./graphs/rerun-new_chr13.pan chr13.pan.new.fa.gz --approx-map 126560.51s user 7903.55s system 3462% cpu 3883.39s total 20414144Kb max memory /usr/local/bin/pggb: line 497: /dev/fd/63: No such file or directory INFO: Cleaning up image...

Do you have any suggestions for these reported errors? Thanks!

I'm looking forward to your reply. Best,Du

ekg commented 11 months ago

It looks like two different issues.

If you re run do you ever get the exact same error in smooth and lace?

GeorgeBGM commented 11 months ago

@ekg @subwaystation @AndreaGuarracino

Hi, developers!

The second attempt is the result of running the PGGB process completely from scratch using the Singularity image (non-root install), which produces an error after the wfmash step , so it could not run to the smoothxg step.

The first attempt was based on the output of the Linux installation version (the smoothxg step was incorrect), and then this step was re-executed using the smoothxg software in Singularity Images.

It really is two different issue. Thanks in advance!

subwaystation commented 11 months ago

Hi @George-du, would it be possible to share your input data or a tiny subset of it, which produces the issues? Thanks!

GeorgeBGM commented 11 months ago

Dear @subwaystation @AndreaGuarracino,

Here is the raw data I used for the above pipeline, please help me check the exact errors. Thanks! (https://sandbox.zenodo.org/record/1234413)

GeorgeBGM commented 11 months ago

Dear @subwaystation @AndreaGuarracino,

Here is the raw data I used in the above pipeline, please help me check the exact error. Thanks!

(https://sandbox.zenodo.org/record/1234413)

At 2023-08-17 20:58:03, "Simon Heumos" @.***> wrote:

Hi @George-du, would it be possible to share your input data or a tiny subset of it, which produces the issues? Thanks!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

GeorgeBGM commented 11 months ago

Dear @subwaystation @AndreaGuarracino,

Can the data be downloaded and used properly?

AndreaGuarracino commented 10 months ago

@George-du, thank you for the data. I am running pggb with it on our cluster, installed by building each tool from GitHub source (so no Docker/Singularity).

pggb -i chr13.pan.new.fa.gz -p 98 -s 50000 -n 236 -k 311 -t 48 -o xxx -D /scratch

It is taking a while. At the moment it is at the 2nd round of SPOA, without issues.

GeorgeBGM commented 10 months ago

Dear @AndreaGuarracino @subwaystation,

Wow, that sounds good. The version of the software I'm using in the PGGB pipeline is as follows. Additionally, I found that some of the chromosome Smoothxg steps were taking an extraordinarily long time to run and ended up generating errors (chr15 ; ~1 month ; Command terminated by signal 7 ) . The detail is as follows:

The software version of PGGB pipeline: Wfmash : v0.10.3-3-g8ba3c53 Seqwish : v0.7.9-0-gd9e7ab5 Smoothxg : v0.6.8-0-ga8a0e9e Odgi : v0.8.2-0-g8715c55

The commands and results are as follows (chr15 ; ~1 month ; Command terminated by signal 7) : RUN_PGGB=“/home/Software/Anaconda/mambaforge-pypy3/envs/pggb/bin/pggb” sbatch -p tissue --job-name=chr15 --mem=300G -c 30 -o ./log/001.test-pggb-graph-chr15.out --wrap "$RUN_PGGB -r -i /home/Project/Pan-genome/002.Merge_Pan_V2/Merge-V1/001.Sequence_partitioning/parts/chr15.pan.new.fa.gz -o ./graphs/chr15.pan -t 30 -p 98 -s 50000 -n 236 -k 311 -O 0.03 -T 30"

Looking forward to the resolution of this issue. Thanks in advance.

subwaystation commented 10 months ago

Is there a possibility for you @George-du to run our latest Docker image? You have quite a lot of data as input ^^ Maybe you ran out of disk space?

GeorgeBGM commented 10 months ago

Dear @subwaystation,

I will contact the administrator and try to run the latest Docker image. Thanks.

AndreaGuarracino commented 10 months ago

@George-du, I was able to finish PGGB. It seems the problem is specific to your installation and/or cluster.

I've used


general:                                                                                                                                                 
  input-fasta:        /lizardfs/guarracino/bug_smoothxg/chr13.pan.new.fa.gz                                                                              
  output-dir:         /lizardfs/guarracino/bug_smoothxg/xxx                                                                                              
  temp-dir:           /scratch          
  resume:             false             
  compress:           false              
  threads:            48                                                                                                                                                               
  poa_threads:        48                                                                                                                                                               
wfmash:                                                                                                                                                                                
  version:            v0.10.4-7-g0981b92                                                                                                                                               
  segment-length:     50000                                                                                                                                                            
  block-length:       250000                                                                                                                                                           
  map-pct-id:         98                                                                                                                                                               
  n-mappings:         236                                                                                                                                                              
  no-splits:          false                                                                                                                                                            
  sparse-map:         false              
  mash-kmer:          19                 
  mash-kmer-thres:    0.001              
  exclude-delim:      false              
  no-merge-segments:  false              
seqwish:                                 
  version:            v0.7.9-2-gf44b402  
  min-match-len:      311                
  sparse-factor:      0                                                                                           
  transclose-batch:   10000000                                                                                    
smoothxg:                                
  version:            v0.7.0-18-g4ff4cf2 
  skip-normalization: false              
  n-haps:             236                
  path-jump-max:      0                  
  edge-jump-max:      0                                                                                                                                  
  poa-length-target:  700,900,1100                                                                                                                       
  poa-params:         1,19,39,3,81,1    
  poa_padding:        0.001             
  run_abpoa:          false             
  run_global_poa:     false             
  pad-max-depth:      100                
  write-maf:          false              
  consensus-spec:     false              
  consensus-prefix:   Consensus_         
  block-id-min:       .9800              
  block-ratio-min:    0                  
odgi:                                    
  version:            v0.8.3-26-gbc7742ed
  viz:                true                                       
  layout:             true                                       
  stats:              false                                      
gfaffix:                                                         
  version:            v0.1.5                                     
  reduce-redundancy:  true                                       
vg:                                                                                                                                                      
  version:            v1.50.1                                                                                                                            
  deconstruct:        false

GeorgeBGM commented 10 months ago

Wow, I'll reinstall the latest version of PGGB and test it out.

pangenome / pggb