pangenome / pggb

the pangenome graph builder
https://doi.org/10.1038/s41592-024-02430-3
MIT License
369 stars 40 forks source link

PGGB discarding SNPs and Small InDels #215

Closed VLoegler closed 2 years ago

VLoegler commented 2 years ago

Is there a way to not take SNPs and small INDELs into account in the final graph ? I am looking for a way to have a Pangenome graph with only structural variants and translocations. Thanks!

ekg commented 2 years ago

There is a currently hidden option to make a consensus graph. It's enabled with -C.

It's hidden because it seems that it sometimes breaks and we haven't had time to debug it

You can use this by following the hidden help text in the pggb script.

Something ilke pggb -C cons,100 should give you what you're interested in. This will be an additional graph written to the output directory.

    #echo "    -C, --consensus-spec SPEC   consensus graph specification: write consensus graphs to"                         
    #echo "                                BASENAME.cons_[spec].gfa; where each spec contains at least a min_len parameter"  
    #echo "                                (which defines the length of divergences from consensus paths to preserve in the" 
    #echo "                                output), optionally a file containing reference paths to preserve in the output," 
    #echo "                                a flag (y/n) indicating whether we should also use the POA consensus paths, a"    
    #echo "                                minimum coverage of consensus paths to retain (min_cov), and a maximum allele"    
    #echo "                                length (max_len, defaults to 1e6); implies -a; example:"                          
    #echo "                                cons,100,1000:refs1.txt:n,1000:refs2.txt:y:2.3:1000000,10000"                     
    #echo "                                [default: off]"                                                                   
ekg commented 2 years ago

Sorry, this is hard-disabled. You'll need to apply this diff to use it.

diff --git a/pggb b/pggb
index a872aef..40c0050 100755
--- a/pggb
+++ b/pggb
@@ -52,7 +52,7 @@ fi

 # read the options
 cmd=$0" "$@
-TEMP=`getopt -o i:o:D:a:p:n:s:l:K:F:k:x:f:B:H:j:P:O:Me:t:T:vhASY:G:Q:d:I:R:NbrmZzV: --long input-fasta:,output-dir:,temp-dir:,input-paf:,map-pct-id:,n-mappings:,segment-length:,block-length-min:,mash-kmer:,mash-kmer-thres:,min-match-length:,sparse-map:,sparse-factor:,transclose-batch:,n-haps:,path-jump-max:,subpath-min:,edge-jump-max:,threads:,poa-threads:,skip-viz,do-layout,help,no-merge-segments,do-stats,exclude-delim:,poa-length-target:,poa-params:,poa-padding:,run-abpoa,global-poa,write-maf,consensus-spec:,consensus-prefix:,pad-max-depth:,block-id-min:,block-ratio-min:,no-splits,resume,keep-temp-files,multiqc,compress,vcf-spec: -n 'pggb' -- "$@"`
+TEMP=`getopt -o i:o:D:a:p:n:s:l:K:F:k:x:f:B:H:j:P:O:Me:t:T:vhASY:G:Q:C:d:I:R:NbrmZzV: --long input-fasta:,output-dir:,temp-dir:,input-paf:,map-pct-id:,n-mappings:,segment-length:,block-length-min:,mash-kmer:,mash-kmer-thres:,min-match-length:,sparse-map:,sparse-factor:,transclose-batch:,n-haps:,path-jump-max:,subpath-min:,edge-jump-max:,threads:,poa-threads:,skip-viz,do-layout,help,no-merge-segments,do-stats,exclude-delim:,poa-length-target:,poa-params:,poa-padding:,run-abpoa,global-poa,write-maf,consensus-spec:,consensus-prefix:,pad-max-depth:,block-id-min:,block-ratio-min:,no-splits,resume,keep-temp-files,multiqc,compress,vcf-spec: -n 'pggb' -- "$@"`
 eval set -- "$TEMP"

 # extract options and their arguments into variables.
@@ -84,7 +84,7 @@ while true ; do
         -b|--run-abpoa) run_abpoa=true ; shift ;;
         -z|--global-poa) run_global_poa=true ; shift ;;
         -M|--write-maf) write_maf=true ; shift ;;
-        #-C|--consensus-spec) consensus_spec=$2 ; shift 2 ;;
+        -C|--consensus-spec) consensus_spec=$2 ; shift 2 ;;
         -Q|--consensus-prefix) consensus_prefix=$2 ; shift 2 ;;
         -t|--threads) threads=$2 ; shift 2 ;;
         -T|--poa-threads) poa_threads=$2 ; shift 2 ;;
ekg commented 2 years ago

I caution that this will still leave nodes that are as long as the POA target length, and there can be SNPs between them. A method that works directly on the PGGB graph output (generic GFA) would seem to be better. It would be amazing if someone developed a generic method to do this.

VLoegler commented 2 years ago

Thanks for the answer!