Closed OZTaekOppa closed 2 months ago
Hi @OZTaekOppa, I would suggest using the chromosome-level scaffolds for phasing if those are your finalized/stable assembly. As such the phased SNPs/SVs can be associated with the annotated genes in the same assembly/coordinate system.
Thank you for your prompt reply. I will let you know after trying your suggestion. Cheers!
I have another question. For SNPs/SVs, is there any option to include more than one vcf file? Let's say, I have three SNP VCFs from Illumina PE, PacBio and Oxford Nanopore. Can I include all three SNP VCF files? And, I have two SV VCFs from PacBio and Oxford Nanopore. Can I include these two SV VCF files? If not, which SNP and SV VCFs would be better for LongPhase? Any suggestion would be really appreciated! Cheers!
No. LongPhase takes only one VCF for SNPs and one for SV. It depends on which platform you trust most for SNP/SV calling and whether you prefer sensitivity or specificity. Below are just my personal comments.
For SNPs, you may check the PrecisionFDA challenges. if your PacBio refers to Hifi (not CLR), I would go with Hifi-called SNPs. If Dragen was used for Illumina-SNP calling, the Illumina VCF should also be good. The PrecisionFDA also suggested the combination of multiple platforms which you may consider.
For SVs, though there were few benchmarks, to my best knowledge, none were as reliable as the PrecisionFDA challenges. The only thing I am sure of is that the accuracy of SVs called by either platform is lower than that of SNPs. As such the SVs concordantly called by these two platforms would be my first choice (for specificity). If you only want either of them, ONT should call more SVs than PacBio yet with lower specificity. In the phasing stage, we would recommend using ONT long reads even if SNPs/SVs were called by Illumina or PacBio or whatever combination, as read length matters more than accuracy in phasing.
I hope this helps.
Thanks for your valuable comments. Now, I have made all the necessary input files for the program. However, I am not sure which BAM file should be in (-b alignment.bam)? As you imagine, I have two bam files; one from Illumina (SNP call) and the other from PacBio/ONT (SV call). Many thanks all the time.
You have to use the long read bam. ONT would be better.
Thanks for your reply all the time. I have tried as the instruction said. The first SNP and SV co-phasing looked ok because there was no error. The second haplotag command caused an issue. Please see the below error message.
FYI, the SNP (Illumina PE reads) was called from BWA MEM and GATK v4 (without BaseRecalibration).
Did I miss something?
Any feedback would be really appreciated!
Hi @OZTaekOppa,
Could you please check the VCF file for the following information?
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
Can you provide the result of the following command
grep PS LPhs_BrahFP.vcf | head -n 10
grep PS LPhs_BrahFP_SV.vcf | head -n 10
Thanks
Thanks for your prompt reply!
Re format,
Re grep PS LPhs_BrahFP.vcf | head -n 10, this file was generated from Illumina PE reads with GATK.
BrahChr1_FP1 1 BrahChr1_FP1_1_A A
Re grep PS LPhs_BrahFP_SV.vcf | head -n 10, this file was generated from long-reads with cuteSV.
BrahChr10_FP1 2 cuteSV.BND.0 N [BrahChr16_FP1:0[N 105.0 PASS PRECISE;SVTYPE=BND;RE=11;RNAMES=BrahFetONT_13742112,BrahFetONT_10953870,BrahFetONT_3103665,BrahFetONT_13596645,BrahFetONT_8832824,BrahFetONT_8218586,BrahFetONT_8571551,BrahFetONT_9149973,BrahFetONT_6733348,BrahFetONT_13695233,BrahFetONT_12979073;AF=1.0 GT:DR:DV:PL:GQ:PS 1/1:0:11:105,28,0:28:. BrahChr10_FP1 2313 cuteSV.INS.0 G GGGATCAGCCACATGATCACTGTCGTTCAGTCAGGTGCTCAGTTCT 8.7 PASS PRECISE;SVTYPE=INS;SVLEN=45;END=2313;CIPOS=-82,82;CILEN=-5,5;RE=7;RNAMES=BrahFetONT_7758083,BrahFetONT_10704010,BrahFetONT_7167530,BrahFetONT_10344073,BrahFetONT_13963964,BrahFetONT_10850772,BrahFetONT_3098190;AF=0.3043 GT:DR:DV:PL:GQ:PS 0/1:16:7:9,1,95:8:. BrahChr10_FP1 3588 cuteSV.INS.1 C CCGCCCATCTGCATGACAATTGCCAGGTGAGCTGCGTGATTCTGTCGGCTCATCTGCTTGACAATCCCAGCTGAGCTGC 187.3 PASS PRECISE;SVTYPE=INS;SVLEN=78;END=3588;CIPOS=-30,30;CILEN=-9,9;RE=31;RNAMES=BrahFetONT_5393058,BrahFetONT_3796473,BrahFetONT_4060562,BrahFetONT_4983571,BrahFetONT_9417613,BrahFetONT_3825902,BrahFetONT_10704010,BrahFetONT_12109135,BrahFetONT_10344073,BrahFetONT_2702150,BrahFetONT_3971571,BrahFetONT_13695233,BrahFetONT_9298532,BrahFetONT_6853143,BrahFetONT_12188515,BrahFetONT_3682276,BrahFetONT_6993887,BrahFetONT_13704695,BrahFetONT_8149995,BrahFetONT_1066796,BrahFetONT_7758083,BrahFetONT_8976138,BrahFetONT_12408260,BrahFetONT_13963964,BrahFetONT_5635706,BrahFetONT_7167530,BrahFetONT_11104630,BrahFetONT_14062239,BrahFetONT_6733348,BrahFetONT_13127513,BrahFetONT_6158707;AF=0.7209 GT:DR:DV:PL:GQ:PS 0/1:12:31:187,1,6:5:. BrahChr10_FP1 4460 cuteSV.DEL.0 CATGATCAGCCACGTGATCACTGACTGCACAGTCACGTGATCAGTGCCTCTCAGTCACGTGATCACCGTATGATAAACACGTGATCAGTGAACG C 0.1 q5 IMPRECISE;SVTYPE=DEL;SVLEN=-93;END=4553;CIPOS=-12,12;CILEN=-6,6;RE=9;RNAMES=BrahFetONT_6807993,BrahFetONT_10344073,BrahFetONT_13971798,BrahFetONT_1608376,BrahFetONT_2702150,BrahFetONT_3796473,BrahFetONT_3825902,BrahFetONT_12209070,BrahFetONT_13704695;AF=0.2195;STRAND=+- GT:DR:DV:PL:GQ:PS 0/0:32:9:0,19,220:18:. BrahChr10_FP1 4574 cuteSV.INS.2 A AATGTTCAGTCAGGTGGTTAGTTCCTGATCAGCCACGGAATCAGT 0.1 q5 IMPRECISE;SVTYPE=INS;SVLEN=44;END=4574;CIPOS=-24,24;CILEN=-4,4;RE=6;RNAMES=BrahFetONT_5183780,BrahFetONT_2283900,BrahFetONT_1514757,BrahFetONT_7335015,BrahFetONT_4766498,BrahFetONT_3135862;AF=0.2 GT:DR:DV:PL:GQ:PS 0/0:24:6:0,19,172:19:. BrahChr10_FP1 4643 cuteSV.INS.3 C CGTCCCATGCTCAGAGCCTGTCAGTCACCTGATCTGCCTGATAAACAGTGATCAGTGCGAACAGCACGTCATCAGTGCATGCGCAGTCATGATGAGTGCATGACAGGTGATCAGTGTGTAC 0.0 q5 IMPRECISE;SVTYPE=INS;SVLEN=120;END=4643;CIPOS=-24,24;CILEN=-12,12;RE=5;RNAMES=BrahFetONT_9532622,BrahFetONT_11296644,BrahFetONT_7463218,BrahFetONT_13695233,BrahFetONT_6733348;AF=0.1724 GT:DR:DV:PL:GQ:PS 0/0:24:5:0,26,181:26:. BrahChr10_FP1 4800 cuteSV.DEL.1 TTGATCAGCCACGTGATCAGTGCCTGCAGAGTCACGTGATCAGGGCCTG T 0.0 q5 IMPRECISE;SVTYPE=DEL;SVLEN=-48;END=4848;CIPOS=-42,42;CILEN=-3,3;RE=14;RNAMES=BrahFetONT_10850772,BrahFetONT_14062239,BrahFetONT_3135862,BrahFetONT_1066796,BrahFetONT_1002696,BrahFetONT_4766498,BrahFetONT_11423571,BrahFetONT_5393058,BrahFetONT_7028860,BrahFetONT_4931796,BrahFetONT_3682276,BrahFetONT_3971571,BrahFetONT_13421098,BrahFetONT_8976138;AF=0.2333;STRAND=+- GT:DR:DV:PL:GQ:PS 0/0:46:14:0,20,305:19:. BrahChr10_FP1 4891 cuteSV.INS.4 C CTCATTGCCAGCTGAGCTTGGGATTCTGCGACTCATTTAGCCAAG 118.6 PASS PRECISE;SVTYPE=INS;SVLEN=44;END=4891;CIPOS=-15,15;CILEN=-1,1;RE=25;RNAMES=BrahFetONT_2567755,BrahFetONT_8571551,BrahFetONT_5876644,BrahFetONT_12209070,BrahFetONT_9149973,BrahFetONT_6158707,BrahFetONT_10704010,BrahFetONT_6807993,BrahFetONT_4983571,BrahFetONT_9324359,BrahFetONT_7028860,BrahFetONT_13971798,BrahFetONT_3825902,BrahFetONT_10356213,BrahFetONT_523643,BrahFetONT_9334000,BrahFetONT_3796473,BrahFetONT_1608376,BrahFetONT_14062239,BrahFetONT_3971571,BrahFetONT_11672843,BrahFetONT_4060562,BrahFetONT_2702150,BrahFetONT_685185,BrahFetONT_10344073;AF=0.5319 GT:DR:DV:PL:GQ:PS 0/1:22:25:119,0,90:89:. BrahChr10_FP1 5882 cuteSV.INS.5 C CCCACATGATCACTGCATGTTCAGTTCCTGATCATCAGCCACCTGATCACTGCATGTTCAGGTGGTTAG 458.8 PASS PRECISE;SVTYPE=INS;SVLEN=68;END=5882;CIPOS=-34,34;CILEN=-5,5;RE=75;RNAMES=BrahFetONT_7758083,BrahFetONT_6292265,BrahFetONT_1514757,BrahFetONT_9324359,BrahFetONT_10850772,BrahFetONT_5876644,BrahFetONT_523643,BrahFetONT_13963964,BrahFetONT_10704010,BrahFetONT_2300906,BrahFetONT_12986704,BrahFetONT_2747967,BrahFetONT_3213196,BrahFetONT_4391037,BrahFetONT_3054294,BrahFetONT_13848751,BrahFetONT_1043954,BrahFetONT_8774927,BrahFetONT_6091667,BrahFetONT_7167530,BrahFetONT_4434546,BrahFetONT_7410833,BrahFetONT_4983571,BrahFetONT_7317771,BrahFetONT_7028860,BrahFetONT_1010531,BrahFetONT_11100551,BrahFetONT_14062239,BrahFetONT_6157363,BrahFetONT_13695233,BrahFetONT_2702150,BrahFetONT_5982345,BrahFetONT_2942672,BrahFetONT_3971571,BrahFetONT_2375602,BrahFetONT_10344073,BrahFetONT_10977499,BrahFetONT_12209070,BrahFetONT_1608376,BrahFetONT_11672843,BrahFetONT_6158707,BrahFetONT_11251423,BrahFetONT_12762111,BrahFetONT_9334000,BrahFetONT_11992769,BrahFetONT_7580884,BrahFetONT_8395881,BrahFetONT_5393058,BrahFetONT_13379399,BrahFetONT_6312644,BrahFetONT_6807993,BrahFetONT_11094384,BrahFetONT_3796473,BrahFetONT_618191,BrahFetONT_13125102,BrahFetONT_13127513,BrahFetONT_6993887,BrahFetONT_2948035,BrahFetONT_5813126,BrahFetONT_13971798,BrahFetONT_3825902,BrahFetONT_1498197,BrahFetONT_10356213,BrahFetONT_4481979,BrahFetONT_8987000,BrahFetONT_1357018,BrahFetONT_7919309,BrahFetONT_7335015,BrahFetONT_11073076,BrahFetONT_5769150,BrahFetONT_1932079,BrahFetONT_13829341,BrahFetONT_922519,BrahFetONT_4766498,BrahFetONT_6798603;AF=0.7353 GT:DR:DV:PL:GQ:PS 1/1:27:75:459,8,1:7:.
After reading again the manual, I have followed the exact instruction for long-read SV but not for short-read SNP. However, there was no explanation for calling SNPs using Illumina PE reads. The manual only explains to generate an SNP file using PEPPER-Margin-DeepVariant and Clair3 which are for long-reads. It is required to call the SNP from the long-reads?
Any suggestion would be really helpful!
Cheers
Hi @OZTaekOppa
Maybe the part of extracting PS information was not handled properly. Could you please remove "##GATKCommandLine=..." and execute the longphase again?
Thanks
Hi @twolinin,
I have tried again.
Executed script:
This script generated two files. -rw-rw---- 1 XXXX Q4840RW 150566273 Jun 4 11:52 LPhs_BrahMP_SV.vcf -rw-rw---- 1 XXXX Q4840RW 13801240011 Jun 4 11:52 LPhs_BrahMP.vcf
/Extra_Programs/longphase/longphase haplotag \ -s /ASM_Mom/Phased/LongPhase/LPhs_BrahMP.vcf \ --sv-file /ASM_Mom/Phased/LongPhase/LPhs_BrahMP_SV.vcf \ -b /ASM_Mom/cuteSV/PBONT_Mrg/BrahMPPOTV9.sorted.bam \ -t 12 \ -o LPhs_BrahMP_Result
This script generated a single file. -rw-rw---- 1 XXXX Q4840RW 95535357165 Jun 4 12:46 LPhs_BrahMP_Result.bam
After these two steps, the log file shows below.
total process time: 2776s total alignment: 42631097 total supplementary: 2515368 total secondary: 0 total unmapped: 0 total tag alignment: 0 total untagged: 24564268 ########################### Job Execution History ############################# JobId:243875.flashmgr2 UserName:XXXX GroupName:qris-uq JobName:BrahMP_LPhs SessionId:8122 ResourcesRequested:mem=200gb,ncpus=12,place=free,walltime=336:00:00 ResourcesUsed:cpupercent=850,cput=05:47:39,mem=382964kb,ncpus=12,vmem=1580208kb,walltime=00:46:32 QueueUsed:General AccountString:qris-uq ExitStatus:0 ###############################################################################
Is this the final output? No phased fasta file (e.g. primary and/or alternative)?? Do I need to run different programs to generate the final phased fasta files?
Thanks.
Hi @OZTaekOppa
I noticed that the ALT field of LPhs_BrahFP.vcf is
BrahChr1_FP1 1 BrahChr1_FP1_1_A A <NON_REF> . . END=187 GT:DP:GQ:MIN_DP:PL:PS 0/0:0:0:0:0,0,0:.
BrahChr1_FP1 188 BrahChr1_FP1_188_G G <NON_REF> . . END=189 GT:DP:GQ:MIN_DP:PL:PS 0/0:2:6:2:0,6,84:.
BrahChr1_FP1 190 BrahChr1_FP1_190_A A <NON_REF> . . END=197 GT:DP:GQ:MIN_DP:PL:PS 0/0:3:9:3:0,9,125:.
I want to check if phased SNP exists in LPhs_BrahMP.vcf after phasing.
grep "|" LPhs_BrahMP.vcf | wc -l
If the number is 0 I want to check the information in the ALT field of heterozygous SNPs.
grep "0/1" LPhs_BrahMP.vcf | head -n 5
Thanks.
Hi @twolinin,
I have checked again.
XXXX@flashlite1:.../ASM_Mom/Phased/LongPhase> grep "|" LPhs_BrahMP.vcf | wc -l 3612187
XXXX@flashlite1:.../ASM_Mom/Phased/LongPhase> grep "0/1" LPhs_BrahMP.vcf | head -n 5
BrahChr1_MP1 93 BrahChr1_MP1_93_C C T,
FYI, I have used GATK 4.2 for calling SNPs (Illumina PE reads).
Thanks.
Hi @OZTaekOppa
I guess the problem is the ALT field. You can remove the " ,longphase phase
again.
sed 's/,<NON_REF>//g' your_unPhasedSNP.vcf
If the problem is still not solved, you can provide some BAM and VCF to help me find the problem faster.
Thanks.
Hi @twolinin,
Thanks. That's what I thought. Some reason why the GATK generated the
Thanks.
Hi @twolinin,
I have tested again with BCFtools (SNP.vcf). However, I do not see any final phased fasta file at all.
/Extra_Programs/longphase/longphase phase \ -s /ASM_Fet/SNPCalls/BCFtools/marked_Brah_FetTPEDupRG_SNP2_var.flt.vcf \ --> SNP.vcf called from BCFtools --sv-file /ASM_Fet/cuteSV/BrahFPONT_cuteSVLPV9.vcf \ --> SV.vcf called from cuteSV -b /ASM_Fet/cuteSV/BrahFPONTV9.sorted.bam \ --> sorted.bam called from cuteSV -r /ASM_Fet/HapG/hapog_results/BrahN2_FetPwM.fasta \ -t 12 \ -o LPhs_BrahFP \ --ont \
This generated the below outcomes without any errors.
-rw-rw---- 1 XXXX Q4840RW 63819454 Jun 8 13:00 LPhs_BrahFP_SV.vcf -rw-rw---- 1 XXXX Q4840RW 1933444110 Jun 8 12:59 LPhs_BrahFP.vcf
############################################################################### parsing VCF ... 22s parsing SV VCF ... 0s reading reference ... 13s parsing contig/chromosome: BrahChr1_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 59s parsing contig/chromosome: BrahChr2_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 48s parsing contig/chromosome: BrahChr3_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 45s parsing contig/chromosome: BrahChr4_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 51s parsing contig/chromosome: BrahChr5_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 41s parsing contig/chromosome: BrahChr6_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 42s parsing contig/chromosome: BrahChr7_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 39s parsing contig/chromosome: BrahChr8_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 45s parsing contig/chromosome: BrahChr9_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 62s parsing contig/chromosome: BrahChr10_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 37s parsing contig/chromosome: BrahChr11_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 34s parsing contig/chromosome: BrahChr12_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 35s parsing contig/chromosome: BrahChr13_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 29s parsing contig/chromosome: BrahChr14_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 30s parsing contig/chromosome: BrahChr15_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 36s parsing contig/chromosome: BrahChr16_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 29s parsing contig/chromosome: BrahChr17_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 27s parsing contig/chromosome: BrahChr18_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 32s parsing contig/chromosome: BrahChr19_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 24s parsing contig/chromosome: BrahChr20_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 27s parsing contig/chromosome: BrahChr21_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 28s parsing contig/chromosome: BrahChr22_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 21s parsing contig/chromosome: BrahChr23_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 21s parsing contig/chromosome: BrahChr24_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 24s parsing contig/chromosome: BrahChr25_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 17s parsing contig/chromosome: BrahChr26_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 16s parsing contig/chromosome: BrahChr27_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 15s parsing contig/chromosome: BrahChr28_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 22s parsing contig/chromosome: BrahChr29_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 23s parsing contig/chromosome: BrahChrX_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 23s parsing contig/chromosome: MtGnmP ... skip writeResult ... 66s write SV Result ... 4s ########################### Job Execution History ############################# JobId:244611.flashmgr2 UserName:XXXX GroupName:qris-uq JobName:BrahFP_LPhs SessionId:23273 ResourcesRequested:mem=200gb,ncpus=12,place=free,walltime=22:00:00 ResourcesUsed:cpupercent=175,cput=00:30:33,mem=41511928kb,ncpus=12,vmem=42719656kb,walltime=00:18:28 QueueUsed:General AccountString:qris-uq ExitStatus:0 ###############################################################################
/Extra_Programs/longphase/longphase haplotag \ -s /ASM_Fet/Phased/LongPhase/BCFSNP/LPhs_BrahFP.vcf \ --> phased SNP.vcf called from longphase --sv-file /ASM_Fet/Phased/LongPhase/BCFSNP/LPhs_BrahFP_SV.vcf \ --> phased SV.vcf called from longphase -b /ASM_Fet/cuteSV/BrahFPONTV9.sorted.bam \ --> sorted.bam called from cuteSV -t 12 \ -o LPhs_BrahFP_Result \
This generated the below outcomes without any errors.
-rw-rw---- 1 XXXX Q4840RW 31954304760 Jun 8 13:21 LPhs_BrahFP_Result.bam
total process time: 887s total alignment: 21029397 total supplementary: 1319137 total secondary: 0 total unmapped: 0 total tag alignment: 8762479 total untagged: 3542710 ########################### Job Execution History ############################# JobId:244613.flashmgr2 UserName:XXXX GroupName:qris-uq JobName:BrahFP_LPhs SessionId:8794 ResourcesRequested:mem=200gb,ncpus=12,place=free,walltime=158:00:00 ResourcesUsed:cpupercent=752,cput=01:53:21,mem=3252340kb,ncpus=12,vmem=4452288kb,walltime=00:15:08 QueueUsed:General AccountString:qris-uq ExitStatus:0 ###############################################################################
According to the second part command (longphase haplotag: -o, --out-prefix=NAME prefix of phasing result. default:result), I expected to see the final phased fasta files (in the result folder) but "Phs_BrahFP_Result.bam."
I have even checked both phased SNP and SV.vcf files.
LPhs_BrahFP.vcf
BrahChr1_FP1 1049 . ATCAGCCAAATGATCAGTGCATGCAGTC A 30.3007 . INDEL;IDV=15;IMF=0.205479;DP=73;VDB=0.767633;SGB=-0.670168;MQSB=0.0901416;MQ0F=0;AC=1;AN=2;DP4=45,18,9,1;MQ=35 GT:PL:PS 0/1:64,0,185:. BrahChr1_FP1 3778 . G A 143.932 . DP=95;VDB=0.136722;SGB=-0.692067;RPB=0.000122914;MQB=0.947213;MQSB=0.959302;BQB=0.94205;MQ0F=0;AC=1;AN=2;DP4=31,31,13,7;MQ=43 GT:PL:PS 1|0:178,0,255:3778 BrahChr1_FP1 4759 . C T 39.0226 . DP=79;VDB=0.00714975;SGB=-0.692562;RPB=0.0474056;MQB=0.000309792;MQSB=0.960591;BQB=0.981795;MQ0F=0;AC=1;AN=2;DP4=37,18,21,1;MQ=39 GT:PL:PS 0|1:73,0,255:3778 BrahChr1_FP1 4761 . T C 35.9011 . DP=76;VDB=0.0072334;SGB=-0.692067;RPB=0.069019;MQB=0.00134558;MQSB=0.986074;BQB=0.864733;MQ0F=0;AC=1;AN=2;DP4=37,18,20,0;MQ=38 GT:PL:PS 0|1:70,0,255:3778 BrahChr1_FP1 4769 . G C 151.4 . DP=51;VDB=1.81478e-05;SGB=-0.692562;RPB=0.220609;MQB=0.125019;MQSB=0.978623;BQB=0.951977;MQ0F=0;AC=1;AN=2;DP4=14,5,20,2;MQ=39 GT:PL:PS 0|1:184,0,236:3778 BrahChr1_FP1 4770 . C G 152.349 . DP=53;VDB=0.000355274;SGB=-0.692976;RPB=0.0566624;MQB=0.0606217;MQSB=0.987253;BQB=0.987912;MQ0F=0;AC=1;AN=2;DP4=9,5,24,2;MQ=39 GT:PL:PS 0|1:185,0,215:3778
There is no NON_REF at all.
LPhs_BrahFP_SV.vcf
Did I miss something?
Thanks.
Hi @OZTaekOppa
The result of using longphase phase
will only produce phased vcf file. Using longphase haplotag
will output phased bam. You can try bcftools consensus
(http://samtools.github.io/bcftools/bcftools.html#consensus) to generate fasta files for each haplotype. You can identify the interval of each block through the PS tag in the phased VCF.
Thanks.
Hi @twolinin,
Thank you for your reply.
According to the LongPhase, it generated these files.
-rw-rw---- 1 XXXX Q4840RW 95685361746 Jun 8 14:31 LPhs_BrahMP_Result.bam (from longphase haplotag) -rw-rw---- 1 XXXX Q4840RW 150895991 Jun 8 13:23 LPhs_BrahMP_SV.vcf (from longphase phase: PacBio long read) -rw-rw---- 1 XXXX Q4840RW 1879656315 Jun 8 13:23 LPhs_BrahMP.vcf (from longphase haplotag: Illumina PE reads)
And, the bcftools consensus requires a vcf file. Or, does it accept both vcf files (SNP and SV)?
bcftools consensus -i -s NA001 -f in.fa in.vcf.gz > out.fa
And, I could use for dipplod option as the below; bcftools consensus -i -s NA001 -f in.fa in.vcf.gz 1 R > out1.fa (Primary) bcftools consensus -i -s NA001 -f in.fa in.vcf.gz 2 A > out2.fa (Alternative)
Which vcf files do I have to use, LPhs_BrahMP_SV.vcf or LPhs_BrahMP.vcf? And, what's the LPhs_BrahMP_Result.bam for?
Thanks.
Hi @OZTaekOppa
If you can only choose one I recommend using the phased SNP file (LPhs_BrahMP.vcf).
Longphase haplotag
will classify reads into different haplotypes according to phased vcf. In LPhs_BrahMP_Result.bam
you may find HP:i:1
or HP:i:2
which represents the haplotype this read belongs to.
Thanks.
Hi,
Can I use longphase to phase contigs and classify reads into different haplotypes, and then assemble each haplotype seperately ? How longphase essure that all HP:i:1 reads from different contigs represent the same haplotype?
Best, Kun
Hi @xiekunwhy
Because the SNP may be unevenly distributed in the genome, this will cause the same contig to be broken into multiple blocks in the phased result. You need to check the HP
and PS
tags at the same time to ensure that the reads are in the same haplotype.
Blcok 1
28364444-10b0-4ec5-bcc7-4a7d18a01bdc 0 chr1 10001 ... HP:i:2 PS:i:11772 PQ:i:40
24894e8b-27d1-4bb6-8f58-cc510c9d46ae 0 chr1 10001 ... HP:i:2 PS:i:11772 PQ:i:40
3b99b313-d5e6-4021-b8aa-a6aa86471c3c 0 chr1 10001 ... HP:i:1 PS:i:11772 PQ:i:9
...
244005c3-c048-42fd-9032-65c2df3b0026 0 chr1 54401 ... HP:i:1 PS:i:11772 PQ:i:40
561e3d07-d089-40c0-9ebe-236a17c851ff 0 chr1 58533 ... HP:i:2 PS:i:11772 PQ:i:40
19497736-fb34-4580-a349-032356e5fa7d 0 chr1 59179 ... HP:i:1 PS:i:11772 PQ:i:40
Blcok 2
2c65251c-389e-4e8c-973d-f78746ae018f 0 chr1 89154 ... HP:i:2 PS:i:89254 PQ:i:17
f91b5366-a6ae-4359-bae8-2284bebe4e45 16 chr1 89637 ... HP:i:2 PS:i:89254 PQ:i:8
34f96928-8003-4b7b-b8be-1bfcf54c4ed3 0 chr1 91896 ... HP:i:2 PS:i:89254 PQ:i:11
...
532b344b-df1d-4968-8183-b0e31319976e 16 chr1 180652 ... HP:i:1 PS:i:89254 PQ:i:40
c355a3aa-af20-4df2-b7ca-e969f39609f5 0 chr1 180749 ... HP:i:2 PS:i:89254 PQ:i:40
7a72c01f-2012-4f06-9e23-392e45f3cb7d 16 chr1 180751 ... HP:i:1 PS:i:89254 PQ:i:6
Thanks
Hi @twolinin ,
Thank you for your reply, but I think my problem was not solved yet.
Actually, I want to know if HP:i:1 in Blcok 1 and HP:i:1 in Blcok 2 and HP:i:1 in block N and HP:i:1 chr2 Blcok N represent the same haplotype ?
If all HP:i:1 in the whole genome represent the same haplotype, I can extract these reads and assemble them to get a haploid genome.
Best, Kun
This is what I want to do, hifiasm haplotypes are not good enough because there are large SVs between two haplotypes
No. HP:i:1/2 is only relevant within the same block/contig/Chr. It looks not a good idea to do this on the basis of a fragmented assembly (the pctg genome?). If there is a chromosome-level (better) assembly of the same species, you may consider using the related genome instead for phasing and reads partition.
Ok, I see, so I need to anchor contigs to chromosome first.
Hi,
Thank you for sharing this tool with the community. After reading the manual and issues, I would like to use LongPhase for my genome projects.
FYI, Input data: PacBio (Sequel) with ~200X coverage (raw read N50 = ~14Kb) Assembled contigs: ~1,000 cotigs (N50 = ~3Mb) Chromosome-level scaffolds: 20 Chromosomes with Pore-C data (PromethION) Trio data: Not available
Given the circumstances, I would like to ask your professional opinions on which stage would be better to use LongPhase (including SNPs/SVs VCF files) 1) "Assembled contigs" or 2) "Chromosome-level scaffolds"?
Thank you in advance!