pangenome / pggb

the pangenome graph builder
https://doi.org/10.1038/s41592-024-02430-3
MIT License
368 stars 41 forks source link

Building a graph from fragmented assemblies #268

Open evcurran opened 1 year ago

evcurran commented 1 year ago

I have been trying to build a pangenome graph of Arabidopsis arenosa (an outcrosser with high heterozygosity) from 10 unscaffolded assemblies (and 10 corresponding alternative haplotype assemblies), and 4 chromosome-level assemblies. Three of the chromosome-level assemblies are from closely related species, and one is a recent A. arenosa build. As the unscaffolded assemblies are quite fragmented (min. contig size of 4kb) I was using the sequence partitioning method to cluster the contigs into chromosome communities, and then building graphs per chromosome, using the following settings:

pggb -i $fasta -s 2000 -p 90 -k 29 -G 3079, 3559 -n 24 -t 12 -v -L -U -S -m -o $outdir

and the output was very messy (for a single chromosome): scaffold_5_arenosa_pri_alt fa 3dd2fe6 2ff309f 57e755b smooth og lay draw_mqc

length: 85,478,416 (largest constituent chromosome is 24,241,940) nodes: 5,153,596 edges: 7,275,076 paths: 3891

To try and and simplify the graph, I used the software ragtag to scaffold the fragmented assemblies to the arenosa chromosome-level assembly, and then built a graph using the 10 "pseudo-scaffolded" primary (+ 10 alternative) assemblies, plus the arenosa reference. Here are a couple of different settings I tried for a single chromosome:

pggb -i $fasta -s 20000 -p 90 -k 47 -G 3079,3559 -n 21 -P 1,4,6,2,26,1 -t 12 -v -S -L -o $outdir

scaffold_1_arenosa_pri_alt_noRefs fa 66905ed 7bdde5a c6d8610 smooth og viz_depth

scaffold_1_arenosa_pri_alt_noRefs fa 66905ed 7bdde5a c6d8610 smooth og lay draw_mqc

length: 122,223,409 (longest constituent chromosome is 26,723,338) nodes: 2,598,044 edges: 3,605,318 paths: 21

pggb -i $fasta -s 10000 -p 90 -k 47 -G 3079,3559 -n 21 -P 1,4,6,2,26,1 -t 12 -v -S -L -o $outdir

scaffold_1_arenosa_pri_alt_noRefs fa 5f35582 7bdde5a c6d8610 smooth og viz_depth

scaffold_1_arenosa_pri_alt_noRefs fa 5f35582 7bdde5a c6d8610 smooth og lay draw_mqc

length: 88,605,057 (longest constituent chromosome is 26,723,338) nodes: 14,316,648 edges: 20,498,846 paths: 21

The linearity has improved, but there is still some very complex looking regions that might not be aligned properly. Do you have any recommendations for the parameters I should be using? For context, I want to capture structural variation among the lineages represented by the 10 assemblies, and then align short reads to the graph so I can genotype SVs in existing sequencing data. I saw there was a parameter -Y to avoid self-mappings, which could reduce complexity, but it’s unclear to me what argument needs to be passed to it. Thank you for any help with this!