starskyzheng / panpop

Application of pan-genome for population
MIT License
93 stars 8 forks source link

Question about final VCF for SV. Types. #26

Closed leone93 closed 6 months ago

leone93 commented 7 months ago

Hello, thanks for the software. Running all the pipelines, I arrived at the final results. For instance, 15.thin3.sv.vcf.gz that contain only SV. The question here is how to interpret the SV type. I understand that the file contains DEL, INS, DUP (but interpreted by the pipeline as insertion), and INV (merged in some of the last steps to the merged_pop results). But the question is how to discriminate between the types. How, can I understand which one is an INS, which one is a DEL, and which one is an INV? I was thinking from the length of the two sequence fields, but I'm not sure.

Second question. How can the software extract INDELS and SNP if the SV caller is set to identify only events bigger than 50 bp? Thanks for your time.

starskyzheng commented 6 months ago

Q1: Structural Variants (SVs) identified by PanPop can be categorized into three types: Insertions (INS), Deletions (DEL), and Divergent variants (DIV). These classifications are based on the following criteria: DEL: A deletion is characterized by a reference length greater than one (length(REF) > 1) and an alternative length of one or fewer (length(ALT) ≤ 1). INS: An insertion is defined by a reference length of one (length(REF) == 1) and an alternative length greater than one (length(ALT) > 1). DIV: Any variant that does not meet the criteria for insertion or deletion is classified as divergent. However, Inversions (INV) may be challenging to distinguish and are frequently classified as divergent variants due to their complex nature.

Q2: PanPop aims to reconcile the common elements of SVs detected by various SV callers. Due to inherent discrepancies among these SV callers, the consolidated structural variant may be segmented into several SNPs and INDELs.