twolinin / longphase

GNU General Public License v3.0
98 stars 6 forks source link

Co-phasing consistency #36

Open ilivyatan opened 7 months ago

ilivyatan commented 7 months ago

Hi, The longphase has an option for co-phasing SNVs, and SVs and SNVs with called modifications. The input for co-phasing is an unphased SNV.vcf, and didn't work with an already phased VCF when I tried. How can I ensure that two consecutive calls to co-phasing, once with SVs and once with methylation variants, will provide a consistent result and use the same phase across all three?

Also, is it possible to unphase a phased snv.vcf by just replacing the | with '/' in the GT field, or will that interfere with subsequent co-phasing attempts with other vcf types?

twolinin commented 7 months ago

Hi @ilivyatan,

  1. Could you please provide commad line you used and elaborate on which file didn't work? As far as I know, phasing of the SNV.vcf can be successfully performed using both unphased SNV.vcf and phased SNV.vcf as input.

  2. Are you ask about how to ensure consistency in the phased SNV.vcf results between the cophasing of SNV.vcf and SV.vcf, as well as between SNV.vcf and modification.vcf?

  3. You can convert a phased VCF to an unphased VCF by changing genotypes like 0|1 and 1|0 to 0/1. Then, remove the PS tag and its corresponding PS value.

thanks

ilivyatan commented 7 months ago
  1. I will try to reproduce
  2. Yes that is the question. And a further question is how to also co-phase a strglr output VCF. Can I supply it as a --sv-file parameter?
  3. This worked, thanks
ythuang0522 commented 7 months ago

For your 2nd question, we recommend co-phasing SNPs, indels, SVs, and modifications at once. If you insist to separate SVs from modifications, I would expect most regions would be consistent as the majority of phasing info comes from SNPs. However, some discrepancy may be seen at regions where SVs and modifications disagree at the haplotype assignment, due to wrong/low-quality SV/modification calls. To maximize the phasing range and obtain phasing agreement of all types of genetic and epigenetic variants, we recommend phasing them at once.

As to your next question, I assume you refer to straglr for short tandem repeat (STR) predictions. We never used straglr and don't know if it can output VCF with read IDs carrying the SV/STR like sniffles2/cuteSV. e.g., Sniffles2: --output-rnames CuteSV: --report_readid--genotype

If their outputted VCF is formatted like sniffles/cutsv, it would be no problem to co-phase it by using the --sv-file argument. From their GitHub repo it is able to output read name and status but I cannot be sure how they encode into VCF. Would be great if you can compare with sniffles2 vcf or provide an example of their vcf.