pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
194 stars 40 forks source link

Approach combining "odgi depth / odgi extract" with "vg deconstruct" VERSUS the approach based on the BubbleGun algorithm for genotyping pangenome graphs: potential inconsistency between these two approaches? #477

Open diaspj opened 1 year ago

diaspj commented 1 year ago

Dear odgi team,

I have used the BubbleGun algorithm recently published to determine the number of bubbles and superbubbles in a pangenome graph of a LAB bacteria.

The command I have used to calculate the bubbles and obtained an output file is the following:

BubbleGun -g Lactococcus_lactis_PANGENOME_Anchorwave.gfa bchains --bubble_json output.json

After the conclusion of BubbleGun computation, the output that I have obtained at the terminal was the following:

The number of Simple Bubbles is 3927 The number of Superbubbles is 72 The number of insertions is 534 Sequence coverage of the bubble chains is 3.1945646478383996% Node coverage of the bubble chains is 2.037636626829558% The longest chain seq-wise has 6220 bp The longest chain bubble_wise has 937 bubbles

Hence, the total number of bubbles detected by BubbleGun is:

3927 + 72 + 534 = 4533

FIRST APPROACH USED TO GENOTYPE THIS PANGENOME GRAPH

The first approach used to genotype this pangenome graph was based on the odgi software suite, as suggested at https://github.com/pangenome/odgi/issues/381.

I have combined the commands “odgi depth” and “odgi extract” to remove the high complexity regions of the graph.

odgi depth -t 20 -i Lactococcus_lactis_PANGENOME_Anchorwave.og -w 1000:0:100:0 > low_to_medium_depth_MAX=100.bed

odgi extract -t 20 -d 0 -i Lactococcus_lactis_PANGENOME_Anchorwave.og -b low_to_medium_depth_MAX=100.bed -o Lactococcus_lactis_PANGENOMEAnchorwave low_to_medium_depth_MAX=100.pruned.og

odgi view -t 20 -g -i Lactococcus_lactis_PANGENOMEAnchorwave low_to_medium_depth_MAX=100.pruned.og > Lactococcus_lactis_PANGENOMEAnchorwave low_to_medium_depth_MAX=100.pruned.gfa

Then, based on the command “vg index”, I have obtained the .xg index file and based on the command “vg gbwt”, I have constructed the .gbwt index file.

Finally, combined with the .xg and the .gbwt index files, I have used the command “vg deconstruct” to obtain the corresponding .vcf file.

The resulting .vcf output file obtained from the simplified pangenome graph using the .bed file obtained at a depth of 100 comprised 4531 rows, missing only two bubbles that were predicted by BubbleGun to be the total number of variations comprised in the pangenome graph (total = 4533).

The two bubbles predicted by BubbleGun correspond to the following two rows in the VCF output file:

Lactococcus_lactis_subsp._lactis_275_CP016702.1 49453 <254259<254257 TA AC 60 . AC=0;AF=0;AN=0;AT=>254257>254258>254259,>254257<637681>254259;NS=0;LV=0 GT

Lactococcus_lactis_subsp._lactis_275_CP016702.1 49465 <254268<254266 TTA AAC 60 . AC=0;AF=0;AN=0;AT=>254266>254267>254268,>254266<426258>254268;NS=0;LV=0 GT

SECOND APPROACH USED TO GENOTYPE THIS PANGENOME GRAPH

The second approach used to genotype this pangenome graph was based on the BubbleGun, although this algorithm does not makes available a .vcf file equivalent to the information comprised in the .json output file.

How I have obtained the above mentioned .vcf file?

Because BubbleGun software suite generates a gfa output file (although lacking the majority of the information comprised in the original gfa file regarding the nodes and paths), under the assumption this file held the same bubble information that the json file held, I have “copy paste” all “nodes” and “paths” comprised in the original input gfa file to the output gfa file, keeping only the “links” determined by BubbleGun software suite.

Then, I just follow the commands described in the first approach. Based on the command “vg index”, I have obtained the .xg index file and based on the command “vg gbwt”, I have constructed the .gbwt index file. Finally, combined with the .xg and the .gbwt index files, I have used the command “vg deconstruct” to obtain the corresponding .vcf file.

Although this approach is not very elegant and requires “copy paste” of information into the BubbleGun gfa output file, if the “augmented” gfa file comprised equivalent information to the one comprised in the .json output file, the resulting .vcf file should comprise 4533 rows, corresponding to the bubbles identified by the BubbleGun software suite.

However, the vg command originated a VCF file comprising only 4218 rows, and 315 bubbles (4533 – 4218) are missing from the resulting vcf file.

OVERALL CONCLUSION: Taken into consideration the results obtained using the first approach (based on odgi) and the second approach (based on BubbleGun) and combining them into a single vcf file, I was able to recover a .vcf file comprising all variations predicted by BubbleGun software suite.

QUESTIONS: Excluding the obvious answer of using the command “vg deconstruct” directly over the original pangenome graph (I have tried this approach, and after more than 3 weeks, the command was still running), I want to pose the following four questions to this forum: 1) Considering that I have imposed a depth of 100 while using the command “odgi extract”, is it expectable to miss bubbles in the pangenome graph corresponding to variations in the vcf file obtained using the command “vg deconstruct”? 2) The identity column in the vcf file regarding the two rows that were found missing after the use of the command “vg deconstruct” were “<254259<254257” and “<254268<254266”, showing a lesser sign (“<”) linking the nodes instead of the bigger sign “>” observed in the identity column of all remaining rows in the vcf file. Might this variation in the vcf output file indicating the existence of a problem in the approach combining “odgi depth” / “odgi extract” with "vg deconstruct? 3) Instead, is it more likely that the BubbleGun algorithm might suffer from some bug and the two missing rows are false positives? 4) Is there a more straightforward set of commands based on the functions made available by the vg and odgi toolkits that can be used to obtain a complete .vcf file in a single run and in a reasonable time frame?

Any help on this issue is welcome,

Best regards,

Paulo Dias

subwaystation commented 1 year ago

Hi @diaspj ,

was @fawaz-dabbaghieh able to help you out?