twolinin / longphase

GNU General Public License v3.0
98 stars 6 forks source link

phasing short tandem repeat VCFs #62

Open ilivyatan opened 3 months ago

ilivyatan commented 3 months ago

Hi, I want to co-phase SNVs and STRs. I've placed the str VCF file as the --sv-file and longphase 'phase' manages to run and phase both VCFs.... BUT, because the STR VCF actually has one line per repeat locus, that includes both allele's repeat lengths, it isn't clear which repeat length the phasing is referring to. The ideal output would enable to determine which phase the expanded repeat is on. I have an example of two such paired VCFs where the SNVs are clearly in the same phase as the expanded repeat, based on viewing the reads in IGV, but the phased reports say that they are two different phases. I can send you the files to a personal email, if you provide.

ythuang0522 commented 3 months ago

Hi, you can send the vcf to ythuang at ccu.edu.tw. We haven't implemented co-phasing STR (assuming called by Staglr) with SNPs but we can take a look first at your example. Kindly let us know if any phased STR benchmark is publicly available.

ythuang0522 commented 3 months ago

Hi Ilana,

Thanks for sending the VCF example. The Staglr VCF is not the same as those of other SV callers (e.g.,Sniffles or CuteSV). Longphase will extract the read IDs carrying the SV from the VCF for phasing. Therefore, you shouldn't trust the phased Staglr VCF as it didn't format as we expected. Staglr stores the essential read information in the provided TSV instead (i.e., repeat copy of each read). As such, the best solution is to write a module dedicated to phasing STR by inputting Staglr TSV instead (e.g., a new option --STR Staglr.tsv), which we haven't supported as I am not sure how many people need this. If you are just testing Staglr with only one sample, you can obtain each read's haplotype assignment (PS tag) from the haplotagging bam (or via --log) by SNP-only phasing. This information can then be joined with the Staglr TSV (using read IDs) to know the haplotype background of each STR repeat copy. I hope this helps.

Yao-Ting