twolinin / longphase

GNU General Public License v3.0
99 stars 9 forks source link

Population-scale co-phasing of SVs & SNPs #11

Open tuannguyen8390 opened 2 years ago

tuannguyen8390 commented 2 years ago

Hi teams,

As per the title, I was wondering what is a good way to do population co-phasing.

Say, I have 3 individuals, then I do it as per suggestion in your README.

Predict SVs for individual A,B,C with Sniffles Predict SNPs for individual A,B,C with Clair3

Cat A,B,C, force-calling the genotype. Then I merge SNPs & SVs of A,B,C into a multi samples VCF. Then run with longphase? Is this a good way to go and is Longphase capable of doing that?

Many thanks,

Tuan Nguyen

ythuang0522 commented 2 years ago

Hi Tuan, LongPhase cannot phase multiple samples simultaneously (i.e., multi-sample VCF) as it's a read-based phasing algorithm. There are two possibilities. The first is phasing each individual independently (with it's own VCF and BAM file). e.g., longphase -s A_SNP.vcf --sv-file A_SV.vcf -b A_alignment.bam ..., longphase -s B_SNP.vcf --sv-file B_SV.vcf -b B_alignment.bam .... Then merge the three (phased) VCFs into a multisample VCF as you want.

However, I am not sure how existing tools (e.g., vcftools) merge phased VCFs. If vcftools can't properly retain the phased info, the alternative is generating a multi-sample VCF as you did, and then re-generate each individual VCF (yet expanded with all population SNPs) by e.g., cat multisample_VCF | cut -f 1-9,10 > A_expanded.vcf, where column 10 should be replaced with the sample column of interest. messageImage_1648438637858

Then you can run LongPhase for each individual with its own expanded VCF and bam files. longphase -s A_expansded_SNP.vcf --sv-file A_expanded_SV.vcf -b A_alignment.bam ...

Yao-Ting

GuillaumeHolley commented 2 years ago

In general, I think it would make a lot of sense to first call SVs with the population-calling mode of Sniffles 2 and GVCF merging for the SNPs (so call each sample with Clair3 but with GVCF output and then merge the GVCF into a multi-sample VCF with GLnexus). If your 3 individuals are a trio and your reads are PacBio HiFi, you can also consider DeepTrio. Then you can separate the multi-sample VCF into 3 individual VCFs with bcftools, phase them independently with LongPhase and re-merged them with bcftools merge (this should keep the phasing).

tuannguyen8390 commented 2 years ago

Interesting, we are currently tried out the population-calling mode with *.snf file with sniffles2. But that seems won't retain read name info (?) - So I'm unsure if that would work with Longphase.

Our samples are indeed a trios but sequenced with ONT.

ythuang0522 commented 2 years ago

From the sniffles2 README, read names can be stored in *.snf, unless they are thrown away after merging.

To output read names in SNF and VCF files, the --output-rnames option is required.

tuannguyen8390 commented 2 years ago

From the sniffles2 README, read names can be stored in *.snf, unless they are thrown away after merging.

To output read names in SNF and VCF files, the --output-rnames option is required.

@ythuang0522 Thanks so much for letting me know, I only used --output-rnames with the single individual calling, didn't know that it would be needed at the merging step! I will test this out and update the result.

In general, I think it would make a lot of sense to first call SVs with the population-calling mode of Sniffles 2 and GVCF merging for the SNPs (so call each sample with Clair3 but with GVCF output and then merge the GVCF into a multi-sample VCF with GLnexus). If your 3 individuals are a trio and your reads are PacBio HiFi, you can also consider DeepTrio. Then you can separate the multi-sample VCF into 3 individual VCFs with bcftools, phase them independently with LongPhase and re-merged them with bcftools merge (this should keep the phasing).

@GuillaumeHolley thanks for the suggestion, I will try this out as well. I haven't used GLNexus or GVCF before so sorry if I ask a stupid question 🔢: could you please explain what is the difference between output Clair3 with GVCF vs VCF there. Because I did think about merging VCF with bcftools merge at that particular step?

GuillaumeHolley commented 2 years ago

@tuannguyen8390 A GVCF file follows the VCF format specification but contains records (lines) which do not correspond to a variant call but instead, a reference call or a no call. The idea is to have some information such as read coverage for sites which are not heterozygous nor homozygous alt. Merging GVCFs over VCF has (at least) 2 key advantages among others: