I have been trying to build a pangenome graph of Arabidopsis arenosa (an outcrosser with high heterozygosity) from 10 unscaffolded assemblies (and 10 corresponding alternative haplotype assemblies), and 4 chromosome-level assemblies. Three of the chromosome-level assemblies are from closely related species, and one is a recent A. arenosa build. As the unscaffolded assemblies are quite fragmented (min. contig size of 4kb) I was using the sequence partitioning method to cluster the contigs into chromosome communities, and then building graphs per chromosome, using the following settings:
To try and and simplify the graph, I used the software ragtag to scaffold the fragmented assemblies to the arenosa chromosome-level assembly, and then built a graph using the 10 "pseudo-scaffolded" primary (+ 10 alternative) assemblies, plus the arenosa reference. Here are a couple of different settings I tried for a single chromosome:
The linearity has improved, but there is still some very complex looking regions that might not be aligned properly. Do you have any recommendations for the parameters I should be using? For context, I want to capture structural variation among the lineages represented by the 10 assemblies, and then align short reads to the graph so I can genotype SVs in existing sequencing data. I saw there was a parameter -Y to avoid self-mappings, which could reduce complexity, but it’s unclear to me what argument needs to be passed to it. Thank you for any help with this!
I have been trying to build a pangenome graph of Arabidopsis arenosa (an outcrosser with high heterozygosity) from 10 unscaffolded assemblies (and 10 corresponding alternative haplotype assemblies), and 4 chromosome-level assemblies. Three of the chromosome-level assemblies are from closely related species, and one is a recent A. arenosa build. As the unscaffolded assemblies are quite fragmented (min. contig size of 4kb) I was using the sequence partitioning method to cluster the contigs into chromosome communities, and then building graphs per chromosome, using the following settings:
pggb -i $fasta -s 2000 -p 90 -k 29 -G 3079, 3559 -n 24 -t 12 -v -L -U -S -m -o $outdir
and the output was very messy (for a single chromosome):
length: 85,478,416 (largest constituent chromosome is 24,241,940) nodes: 5,153,596 edges: 7,275,076 paths: 3891
To try and and simplify the graph, I used the software ragtag to scaffold the fragmented assemblies to the arenosa chromosome-level assembly, and then built a graph using the 10 "pseudo-scaffolded" primary (+ 10 alternative) assemblies, plus the arenosa reference. Here are a couple of different settings I tried for a single chromosome:
pggb -i $fasta -s 20000 -p 90 -k 47 -G 3079,3559 -n 21 -P 1,4,6,2,26,1 -t 12 -v -S -L -o $outdir
length: 122,223,409 (longest constituent chromosome is 26,723,338) nodes: 2,598,044 edges: 3,605,318 paths: 21
pggb -i $fasta -s 10000 -p 90 -k 47 -G 3079,3559 -n 21 -P 1,4,6,2,26,1 -t 12 -v -S -L -o $outdir
length: 88,605,057 (longest constituent chromosome is 26,723,338) nodes: 14,316,648 edges: 20,498,846 paths: 21
The linearity has improved, but there is still some very complex looking regions that might not be aligned properly. Do you have any recommendations for the parameters I should be using? For context, I want to capture structural variation among the lineages represented by the 10 assemblies, and then align short reads to the graph so I can genotype SVs in existing sequencing data. I saw there was a parameter -Y to avoid self-mappings, which could reduce complexity, but it’s unclear to me what argument needs to be passed to it. Thank you for any help with this!