twolinin / longphase

GNU General Public License v3.0
96 stars 6 forks source link

Applying to contigs or chromosome-level scaffolds #13

Closed OZTaekOppa closed 2 months ago

OZTaekOppa commented 2 years ago

Hi,

Thank you for sharing this tool with the community. After reading the manual and issues, I would like to use LongPhase for my genome projects.

FYI, Input data: PacBio (Sequel) with ~200X coverage (raw read N50 = ~14Kb) Assembled contigs: ~1,000 cotigs (N50 = ~3Mb) Chromosome-level scaffolds: 20 Chromosomes with Pore-C data (PromethION) Trio data: Not available

Given the circumstances, I would like to ask your professional opinions on which stage would be better to use LongPhase (including SNPs/SVs VCF files) 1) "Assembled contigs" or 2) "Chromosome-level scaffolds"?

Thank you in advance!

ythuang0522 commented 2 years ago

Hi @OZTaekOppa, I would suggest using the chromosome-level scaffolds for phasing if those are your finalized/stable assembly. As such the phased SNPs/SVs can be associated with the annotated genes in the same assembly/coordinate system.

OZTaekOppa commented 2 years ago

Thank you for your prompt reply. I will let you know after trying your suggestion. Cheers!

OZTaekOppa commented 2 years ago

I have another question. For SNPs/SVs, is there any option to include more than one vcf file? Let's say, I have three SNP VCFs from Illumina PE, PacBio and Oxford Nanopore. Can I include all three SNP VCF files? And, I have two SV VCFs from PacBio and Oxford Nanopore. Can I include these two SV VCF files? If not, which SNP and SV VCFs would be better for LongPhase? Any suggestion would be really appreciated! Cheers!

ythuang0522 commented 2 years ago

No. LongPhase takes only one VCF for SNPs and one for SV. It depends on which platform you trust most for SNP/SV calling and whether you prefer sensitivity or specificity. Below are just my personal comments.

For SNPs, you may check the PrecisionFDA challenges. if your PacBio refers to Hifi (not CLR), I would go with Hifi-called SNPs. If Dragen was used for Illumina-SNP calling, the Illumina VCF should also be good. The PrecisionFDA also suggested the combination of multiple platforms which you may consider.

For SVs, though there were few benchmarks, to my best knowledge, none were as reliable as the PrecisionFDA challenges. The only thing I am sure of is that the accuracy of SVs called by either platform is lower than that of SNPs. As such the SVs concordantly called by these two platforms would be my first choice (for specificity). If you only want either of them, ONT should call more SVs than PacBio yet with lower specificity. In the phasing stage, we would recommend using ONT long reads even if SNPs/SVs were called by Illumina or PacBio or whatever combination, as read length matters more than accuracy in phasing.

I hope this helps.

OZTaekOppa commented 2 years ago

Thanks for your valuable comments. Now, I have made all the necessary input files for the program. However, I am not sure which BAM file should be in (-b alignment.bam)? As you imagine, I have two bam files; one from Illumina (SNP call) and the other from PacBio/ONT (SV call). Many thanks all the time.

ythuang0522 commented 2 years ago

You have to use the long read bam. ONT would be better.

OZTaekOppa commented 2 years ago

Thanks for your reply all the time. I have tried as the instruction said. The first SNP and SV co-phasing looked ok because there was no error. The second haplotag command caused an issue. Please see the below error message.

phased SNP file: /ASM_Fet/Phased/LongPhase/LPhs_BrahFP.vcf phased SV file: /ASM_Fet/Phased/LongPhase/LPhs_BrahFP_SV.vcf input bam file: /ASM_Fet/cuteSV/aln_BrahFPONT.sorted.bam output bam file: LPhs_BrahFP_Result.bam number of threads: 8 write log file: false log file:

filter mapping quality below: 20 percentage threshold: 0.6 tag supplementary: false

parsing SNP VCF ... ERROR: not found PS type (Type=Integer or Type=String).

FYI, the SNP (Illumina PE reads) was called from BWA MEM and GATK v4 (without BaseRecalibration).

Did I miss something?

Any feedback would be really appreciated!

twolinin commented 2 years ago

Hi @OZTaekOppa,

Could you please check the VCF file for the following information? ##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">

Can you provide the result of the following command

grep PS LPhs_BrahFP.vcf | head -n 10
grep PS LPhs_BrahFP_SV.vcf | head -n 10

Thanks

OZTaekOppa commented 2 years ago

Thanks for your prompt reply!

Re format,

FORMAT=

Re grep PS LPhs_BrahFP.vcf | head -n 10, this file was generated from Illumina PE reads with GATK.

FORMAT=

GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --emit-ref-confidence GVCF --output /nvme/pbs/tmpdir/pbs.235932.flashmgr2/marked_Brah_FetPEDup_RG_wo.gvcf.gz --input /nvme/pbs/tmpdir/pbs.235932.flashmgr2/marked_Brah_FetPEDup.rg.bam --reference /nvme/pbs/tmpdir/pbs.235932.flashmgr2/BrahN2_FetPwM.fasta --use-posteriors-to-calculate-qual false --dont-use-dragstr-priors false --use-new-qual-calculator true --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 30.0 --max-alternate-alleles 6 --max-genotype-count 1024 --sample-ploidy 2 --num-reference-samples-if-no-call 0 --genotype-assignment-method USE_PLS_TO_ASSIGN --contamination-fraction-to-filter 0.0 --output-mode EMIT_VARIANTS_ONLY --all-site-pls false --gvcf-gq-bands 1 --gvcf-gq-bands 2 --gvcf-gq-bands 3 --gvcf-gq-bands 4 --gvcf-gq-bands 5 --gvcf-gq-bands 6 --gvcf-gq-bands 7 --gvcf-gq-bands 8 --gvcf-gq-bands 9 --gvcf-gq-bands 10 --gvcf-gq-bands 11 --gvcf-gq-bands 12 --gvcf-gq-bands 13 --gvcf-gq-bands 14 --gvcf-gq-bands 15 --gvcf-gq-bands 16 --gvcf-gq-bands 17 --gvcf-gq-bands 18 --gvcf-gq-bands 19 --gvcf-gq-bands 20 --gvcf-gq-bands 21 --gvcf-gq-bands 22 --gvcf-gq-bands 23 --gvcf-gq-bands 24 --gvcf-gq-bands 25 --gvcf-gq-bands 26 --gvcf-gq-bands 27 --gvcf-gq-bands 28 --gvcf-gq-bands 29 --gvcf-gq-bands 30 --gvcf-gq-bands 31 --gvcf-gq-bands 32 --gvcf-gq-bands 33 --gvcf-gq-bands 34 --gvcf-gq-bands 35 --gvcf-gq-bands 36 --gvcf-gq-bands 37 --gvcf-gq-bands 38 --gvcf-gq-bands 39 --gvcf-gq-bands 40 --gvcf-gq-bands 41 --gvcf-gq-bands 42 --gvcf-gq-bands 43 --gvcf-gq-bands 44 --gvcf-gq-bands 45 --gvcf-gq-bands 46 --gvcf-gq-bands 47 --gvcf-gq-bands 48 --gvcf-gq-bands 49 --gvcf-gq-bands 50 --gvcf-gq-bands 51 --gvcf-gq-bands 52 --gvcf-gq-bands 53 --gvcf-gq-bands 54 --gvcf-gq-bands 55 --gvcf-gq-bands 56 --gvcf-gq-bands 57 --gvcf-gq-bands 58 --gvcf-gq-bands 59 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --gvcf-gq-bands 99 --floor-blocks false --indel-size-to-eliminate-in-ref-model 10 --disable-optimizations false --dragen-mode false --apply-bqd false --apply-frd false --disable-spanning-event-genotyping false --transform-dragen-mapping-quality false --mapping-quality-threshold-for-genotyping 20 --max-effective-depth-adjustment-for-frd 0 --just-determine-active-regions false --dont-genotype false --do-not-run-physical-phasing false --do-not-correct-overlapping-quality false --use-filtered-reads-for-annotations false --adaptive-pruning false --do-not-recover-dangling-branches false --recover-dangling-heads false --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --max-num-haplotypes-in-population 128 --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --pruning-seeding-lod-threshold 9.210340371976184 --max-unpruned-variants 100 --linked-de-bruijn-graph false --disable-artificial-haplotype-recovery false --enable-legacy-graph-cycle-detection false --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --num-matching-bases-in-dangling-end-to-recover -1 --error-correction-log-odds -Infinity --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --base-quality-score-threshold 18 --dragstr-het-hom-ratio 2 --dont-use-dragstr-pair-hmm-scores false --pair-hmm-gap-continuation-penalty 10 --expected-mismatch-rate-for-read-disqualification 0.02 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --disable-symmetric-hmm-normalizing false --disable-cap-base-qualities-to-map-quality false --enable-dynamic-read-disqualification-for-genotyping false --dynamic-read-disqualification-threshold 1.0 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --pileup-detection false --pileup-detection-enable-indel-pileup-calling false --num-artificial-haplotypes-to-add-per-allele 5 --artifical-haplotype-filtering-kmer-size 10 --pileup-detection-snp-alt-threshold 0.1 --pileup-detection-indel-alt-threshold 0.5 --pileup-detection-absolute-alt-depth 0.0 --pileup-detection-snp-adjacent-to-assembled-indel-range 5 --pileup-detection-bad-read-tolerance 0.0 --pileup-detection-proper-pair-read-badness true --pileup-detection-edit-distance-read-badness-threshold 0.08 --pileup-detection-chimeric-read-badness true --pileup-detection-template-mean-badness-threshold 0.0 --pileup-detection-template-std-badness-threshold 0.0 --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --min-base-quality-score 10 --smith-waterman JAVA --max-mnp-distance 0 --force-call-filtered-alleles false --soft-clip-low-quality-ends false --allele-informative-reads-overlap-margin 2 --smith-waterman-dangling-end-match-value 25 --smith-waterman-dangling-end-mismatch-penalty -50 --smith-waterman-dangling-end-gap-open-penalty -110 --smith-waterman-dangling-end-gap-extend-penalty -6 --smith-waterman-haplotype-to-reference-match-value 200 --smith-waterman-haplotype-to-reference-mismatch-penalty -150 --smith-waterman-haplotype-to-reference-gap-open-penalty -260 --smith-waterman-haplotype-to-reference-gap-extend-penalty -11 --smith-waterman-read-to-haplotype-match-value 10 --smith-waterman-read-to-haplotype-mismatch-penalty -15 --smith-waterman-read-to-haplotype-gap-open-penalty -30 --smith-waterman-read-to-haplotype-gap-extend-penalty -5 --min-assembly-region-size 50 --max-assembly-region-size 300 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --assembly-region-padding 100 --padding-around-indels 75 --padding-around-snps 20 --padding-around-strs 75 --max-extension-into-assembly-region-padding-legacy 25 --max-reads-per-alignment-start 50 --enable-legacy-assembly-region-trimming false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --max-variants-per-shard 0 --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false --allow-old-rms-mapping-quality-annotation-data false",Version="4.2.6.1-18-g72684d0-SNAPSHOT",Date="19 May 2022 8:22:22 PM">

BrahChr1_FP1 1 BrahChr1_FP1_1_A A . . END=187 GT:DP:GQ:MIN_DP:PL:PS 0/0:0:0:0:0,0,0:. BrahChr1_FP1 188 BrahChr1_FP1_188_G G . . END=189 GT:DP:GQ:MIN_DP:PL:PS 0/0:2:6:2:0,6,84:. BrahChr1_FP1 190 BrahChr1_FP1_190_A A . . END=197 GT:DP:GQ:MIN_DP:PL:PS 0/0:3:9:3:0,9,125:. BrahChr1_FP1 198 BrahChr1_FP1_198_C C T, 69.83 . DP=3;ExcessHet=0;MLEAC=1,0;MLEAF=0.5,0;RAW_MQandDP=1601,3 GT:AD:DP:GQ:PL:SB:PS 1/1:0,3,0:3:9:83,9,0,83,9,83:0,0,1,2:. BrahChr1_FP1 199 BrahChr1_FP1_199_C C . . END=258 GT:DP:GQ:MIN_DP:PL:PS 0/0:3:9:3:0,9,125:. BrahChr1_FP1 259 BrahChr1_FP1_259_A A . . END=267 GT:DP:GQ:MIN_DP:PL:PS 0/0:4:12:4:0,12,167:. BrahChr1_FP1 268 BrahChr1_FP1_268_C C . . END=268 GT:DP:GQ:MIN_DP:PL:PS 0/0:4:0:4:0,0,84:. BrahChr1_FP1 269 BrahChr1_FP1_269_A A . . END=269 GT:DP:GQ:MIN_DP:PL:PS 0/0:4:12:4:0,12,167:.

Re grep PS LPhs_BrahFP_SV.vcf | head -n 10, this file was generated from long-reads with cuteSV.

FORMAT=

BrahChr10_FP1 2 cuteSV.BND.0 N [BrahChr16_FP1:0[N 105.0 PASS PRECISE;SVTYPE=BND;RE=11;RNAMES=BrahFetONT_13742112,BrahFetONT_10953870,BrahFetONT_3103665,BrahFetONT_13596645,BrahFetONT_8832824,BrahFetONT_8218586,BrahFetONT_8571551,BrahFetONT_9149973,BrahFetONT_6733348,BrahFetONT_13695233,BrahFetONT_12979073;AF=1.0 GT:DR:DV:PL:GQ:PS 1/1:0:11:105,28,0:28:. BrahChr10_FP1 2313 cuteSV.INS.0 G GGGATCAGCCACATGATCACTGTCGTTCAGTCAGGTGCTCAGTTCT 8.7 PASS PRECISE;SVTYPE=INS;SVLEN=45;END=2313;CIPOS=-82,82;CILEN=-5,5;RE=7;RNAMES=BrahFetONT_7758083,BrahFetONT_10704010,BrahFetONT_7167530,BrahFetONT_10344073,BrahFetONT_13963964,BrahFetONT_10850772,BrahFetONT_3098190;AF=0.3043 GT:DR:DV:PL:GQ:PS 0/1:16:7:9,1,95:8:. BrahChr10_FP1 3588 cuteSV.INS.1 C CCGCCCATCTGCATGACAATTGCCAGGTGAGCTGCGTGATTCTGTCGGCTCATCTGCTTGACAATCCCAGCTGAGCTGC 187.3 PASS PRECISE;SVTYPE=INS;SVLEN=78;END=3588;CIPOS=-30,30;CILEN=-9,9;RE=31;RNAMES=BrahFetONT_5393058,BrahFetONT_3796473,BrahFetONT_4060562,BrahFetONT_4983571,BrahFetONT_9417613,BrahFetONT_3825902,BrahFetONT_10704010,BrahFetONT_12109135,BrahFetONT_10344073,BrahFetONT_2702150,BrahFetONT_3971571,BrahFetONT_13695233,BrahFetONT_9298532,BrahFetONT_6853143,BrahFetONT_12188515,BrahFetONT_3682276,BrahFetONT_6993887,BrahFetONT_13704695,BrahFetONT_8149995,BrahFetONT_1066796,BrahFetONT_7758083,BrahFetONT_8976138,BrahFetONT_12408260,BrahFetONT_13963964,BrahFetONT_5635706,BrahFetONT_7167530,BrahFetONT_11104630,BrahFetONT_14062239,BrahFetONT_6733348,BrahFetONT_13127513,BrahFetONT_6158707;AF=0.7209 GT:DR:DV:PL:GQ:PS 0/1:12:31:187,1,6:5:. BrahChr10_FP1 4460 cuteSV.DEL.0 CATGATCAGCCACGTGATCACTGACTGCACAGTCACGTGATCAGTGCCTCTCAGTCACGTGATCACCGTATGATAAACACGTGATCAGTGAACG C 0.1 q5 IMPRECISE;SVTYPE=DEL;SVLEN=-93;END=4553;CIPOS=-12,12;CILEN=-6,6;RE=9;RNAMES=BrahFetONT_6807993,BrahFetONT_10344073,BrahFetONT_13971798,BrahFetONT_1608376,BrahFetONT_2702150,BrahFetONT_3796473,BrahFetONT_3825902,BrahFetONT_12209070,BrahFetONT_13704695;AF=0.2195;STRAND=+- GT:DR:DV:PL:GQ:PS 0/0:32:9:0,19,220:18:. BrahChr10_FP1 4574 cuteSV.INS.2 A AATGTTCAGTCAGGTGGTTAGTTCCTGATCAGCCACGGAATCAGT 0.1 q5 IMPRECISE;SVTYPE=INS;SVLEN=44;END=4574;CIPOS=-24,24;CILEN=-4,4;RE=6;RNAMES=BrahFetONT_5183780,BrahFetONT_2283900,BrahFetONT_1514757,BrahFetONT_7335015,BrahFetONT_4766498,BrahFetONT_3135862;AF=0.2 GT:DR:DV:PL:GQ:PS 0/0:24:6:0,19,172:19:. BrahChr10_FP1 4643 cuteSV.INS.3 C CGTCCCATGCTCAGAGCCTGTCAGTCACCTGATCTGCCTGATAAACAGTGATCAGTGCGAACAGCACGTCATCAGTGCATGCGCAGTCATGATGAGTGCATGACAGGTGATCAGTGTGTAC 0.0 q5 IMPRECISE;SVTYPE=INS;SVLEN=120;END=4643;CIPOS=-24,24;CILEN=-12,12;RE=5;RNAMES=BrahFetONT_9532622,BrahFetONT_11296644,BrahFetONT_7463218,BrahFetONT_13695233,BrahFetONT_6733348;AF=0.1724 GT:DR:DV:PL:GQ:PS 0/0:24:5:0,26,181:26:. BrahChr10_FP1 4800 cuteSV.DEL.1 TTGATCAGCCACGTGATCAGTGCCTGCAGAGTCACGTGATCAGGGCCTG T 0.0 q5 IMPRECISE;SVTYPE=DEL;SVLEN=-48;END=4848;CIPOS=-42,42;CILEN=-3,3;RE=14;RNAMES=BrahFetONT_10850772,BrahFetONT_14062239,BrahFetONT_3135862,BrahFetONT_1066796,BrahFetONT_1002696,BrahFetONT_4766498,BrahFetONT_11423571,BrahFetONT_5393058,BrahFetONT_7028860,BrahFetONT_4931796,BrahFetONT_3682276,BrahFetONT_3971571,BrahFetONT_13421098,BrahFetONT_8976138;AF=0.2333;STRAND=+- GT:DR:DV:PL:GQ:PS 0/0:46:14:0,20,305:19:. BrahChr10_FP1 4891 cuteSV.INS.4 C CTCATTGCCAGCTGAGCTTGGGATTCTGCGACTCATTTAGCCAAG 118.6 PASS PRECISE;SVTYPE=INS;SVLEN=44;END=4891;CIPOS=-15,15;CILEN=-1,1;RE=25;RNAMES=BrahFetONT_2567755,BrahFetONT_8571551,BrahFetONT_5876644,BrahFetONT_12209070,BrahFetONT_9149973,BrahFetONT_6158707,BrahFetONT_10704010,BrahFetONT_6807993,BrahFetONT_4983571,BrahFetONT_9324359,BrahFetONT_7028860,BrahFetONT_13971798,BrahFetONT_3825902,BrahFetONT_10356213,BrahFetONT_523643,BrahFetONT_9334000,BrahFetONT_3796473,BrahFetONT_1608376,BrahFetONT_14062239,BrahFetONT_3971571,BrahFetONT_11672843,BrahFetONT_4060562,BrahFetONT_2702150,BrahFetONT_685185,BrahFetONT_10344073;AF=0.5319 GT:DR:DV:PL:GQ:PS 0/1:22:25:119,0,90:89:. BrahChr10_FP1 5882 cuteSV.INS.5 C CCCACATGATCACTGCATGTTCAGTTCCTGATCATCAGCCACCTGATCACTGCATGTTCAGGTGGTTAG 458.8 PASS PRECISE;SVTYPE=INS;SVLEN=68;END=5882;CIPOS=-34,34;CILEN=-5,5;RE=75;RNAMES=BrahFetONT_7758083,BrahFetONT_6292265,BrahFetONT_1514757,BrahFetONT_9324359,BrahFetONT_10850772,BrahFetONT_5876644,BrahFetONT_523643,BrahFetONT_13963964,BrahFetONT_10704010,BrahFetONT_2300906,BrahFetONT_12986704,BrahFetONT_2747967,BrahFetONT_3213196,BrahFetONT_4391037,BrahFetONT_3054294,BrahFetONT_13848751,BrahFetONT_1043954,BrahFetONT_8774927,BrahFetONT_6091667,BrahFetONT_7167530,BrahFetONT_4434546,BrahFetONT_7410833,BrahFetONT_4983571,BrahFetONT_7317771,BrahFetONT_7028860,BrahFetONT_1010531,BrahFetONT_11100551,BrahFetONT_14062239,BrahFetONT_6157363,BrahFetONT_13695233,BrahFetONT_2702150,BrahFetONT_5982345,BrahFetONT_2942672,BrahFetONT_3971571,BrahFetONT_2375602,BrahFetONT_10344073,BrahFetONT_10977499,BrahFetONT_12209070,BrahFetONT_1608376,BrahFetONT_11672843,BrahFetONT_6158707,BrahFetONT_11251423,BrahFetONT_12762111,BrahFetONT_9334000,BrahFetONT_11992769,BrahFetONT_7580884,BrahFetONT_8395881,BrahFetONT_5393058,BrahFetONT_13379399,BrahFetONT_6312644,BrahFetONT_6807993,BrahFetONT_11094384,BrahFetONT_3796473,BrahFetONT_618191,BrahFetONT_13125102,BrahFetONT_13127513,BrahFetONT_6993887,BrahFetONT_2948035,BrahFetONT_5813126,BrahFetONT_13971798,BrahFetONT_3825902,BrahFetONT_1498197,BrahFetONT_10356213,BrahFetONT_4481979,BrahFetONT_8987000,BrahFetONT_1357018,BrahFetONT_7919309,BrahFetONT_7335015,BrahFetONT_11073076,BrahFetONT_5769150,BrahFetONT_1932079,BrahFetONT_13829341,BrahFetONT_922519,BrahFetONT_4766498,BrahFetONT_6798603;AF=0.7353 GT:DR:DV:PL:GQ:PS 1/1:27:75:459,8,1:7:.

After reading again the manual, I have followed the exact instruction for long-read SV but not for short-read SNP. However, there was no explanation for calling SNPs using Illumina PE reads. The manual only explains to generate an SNP file using PEPPER-Margin-DeepVariant and Clair3 which are for long-reads. It is required to call the SNP from the long-reads?

Any suggestion would be really helpful!

Cheers

twolinin commented 2 years ago

Hi @OZTaekOppa

Maybe the part of extracting PS information was not handled properly. Could you please remove "##GATKCommandLine=..." and execute the longphase again?

Thanks

OZTaekOppa commented 2 years ago

Hi @twolinin,

I have tried again.

Executed script:

Step 1: SNP and Sv co-phasing

/Extra_Programs/longphase/longphase phase \

-s ASM_Mom/SNPCalls/GATK/marked_Brah_MomTPEDup_RGID_wo3.vcf \

--sv-file /ASM_Mom/cuteSV/PBONT_Mrg/BrahMPOT_cuteSVLPV9.vcf \

-b /ASM_Mom/cuteSV/PBONT_Mrg/BrahMPPOTV9.sorted.bam \

-r /ASM_Mom/HapG/hapog_results/BrahN2_MomPwM.fasta \

-t 12 \

-o LPhs_BrahMP \

--pb

This script generated two files. -rw-rw---- 1 XXXX Q4840RW 150566273 Jun 4 11:52 LPhs_BrahMP_SV.vcf -rw-rw---- 1 XXXX Q4840RW 13801240011 Jun 4 11:52 LPhs_BrahMP.vcf

Step 2: Haplotag command

/Extra_Programs/longphase/longphase haplotag \ -s /ASM_Mom/Phased/LongPhase/LPhs_BrahMP.vcf \ --sv-file /ASM_Mom/Phased/LongPhase/LPhs_BrahMP_SV.vcf \ -b /ASM_Mom/cuteSV/PBONT_Mrg/BrahMPPOTV9.sorted.bam \ -t 12 \ -o LPhs_BrahMP_Result

This script generated a single file. -rw-rw---- 1 XXXX Q4840RW 95535357165 Jun 4 12:46 LPhs_BrahMP_Result.bam

After these two steps, the log file shows below.

############################################################################### phased SNP file: /ASM_Mom/Phased/LongPhase/LPhs_BrahMP.vcf phased SV file: /ASM_Mom/Phased/LongPhase/LPhs_BrahMP_SV.vcf input bam file: /ASM_Mom/cuteSV/PBONT_Mrg/BrahMPPOTV9.sorted.bam output bam file: LPhs_BrahMP_Result.bam number of threads: 12 write log file: false log file:

filter mapping quality below: 20 percentage threshold: 0.6 tag supplementary: false

parsing SNP VCF ... 256s parsing SV VCF ... 0s tag read start ... chr: BrahChr1_MP1 ... 146s chr: BrahChr2_MP1 ... 119s chr: BrahChr3_M1 ... 107s chr: BrahChr4_MP1 ... 110s chr: BrahChr5_MP1 ... 107s chr: BrahChr6_MP1 ... 106s chr: BrahChr7_MP1 ... 100s chr: BrahChr8_MP1 ... 125s chr: BrahChr9_MP1 ... 211s chr: BrahChr10_MP1 ... 86s chr: BrahChr11_MP1 ... 86s chr: BrahChr12_MP1 ... 71s chr: BrahChr13_MP1 ... 68s chr: BrahChr14_MP1 ... 69s chr: BrahChr15_MP1 ... 70s chr: BrahChr16_MP1 ... 65s chr: BrahChr17_MP1 ... 60s chr: BrahChr18_MP1 ... 103s chr: BrahChr19_MP1 ... 58s chr: BrahChr20_MP1 ... 59s chr: BrahChr21_MP1 ... 66s chr: BrahChr22_MP1 ... 50s chr: BrahChr23_MP1 ... 46s chr: BrahChr24_MP1 ... 51s chr: BrahChr25_MP1 ... 37s chr: BrahChr26_MP1 ... 42s chr: BrahChr27_MP1 ... 40s chr: BrahChr28_MP1 ... 70s chr: BrahChr29_MP1 ... 50s chr: BrahChrX_MP1 ... 137s chr: MtGnmP ... 4s tag read 2520s

total process time: 2776s total alignment: 42631097 total supplementary: 2515368 total secondary: 0 total unmapped: 0 total tag alignment: 0 total untagged: 24564268 ########################### Job Execution History ############################# JobId:243875.flashmgr2 UserName:XXXX GroupName:qris-uq JobName:BrahMP_LPhs SessionId:8122 ResourcesRequested:mem=200gb,ncpus=12,place=free,walltime=336:00:00 ResourcesUsed:cpupercent=850,cput=05:47:39,mem=382964kb,ncpus=12,vmem=1580208kb,walltime=00:46:32 QueueUsed:General AccountString:qris-uq ExitStatus:0 ###############################################################################

Is this the final output? No phased fasta file (e.g. primary and/or alternative)?? Do I need to run different programs to generate the final phased fasta files?

Thanks.

twolinin commented 2 years ago

Hi @OZTaekOppa

I noticed that the ALT field of LPhs_BrahFP.vcf is . LongPhase expects the ALT field of heterozygous SNPs to be non-REF alleles, specifically one of the letters of A, T, C, G instead of .

BrahChr1_FP1 1 BrahChr1_FP1_1_A A <NON_REF> . . END=187 GT:DP:GQ:MIN_DP:PL:PS 0/0:0:0:0:0,0,0:.
BrahChr1_FP1 188 BrahChr1_FP1_188_G G <NON_REF> . . END=189 GT:DP:GQ:MIN_DP:PL:PS 0/0:2:6:2:0,6,84:.
BrahChr1_FP1 190 BrahChr1_FP1_190_A A <NON_REF> . . END=197 GT:DP:GQ:MIN_DP:PL:PS 0/0:3:9:3:0,9,125:.

I want to check if phased SNP exists in LPhs_BrahMP.vcf after phasing. grep "|" LPhs_BrahMP.vcf | wc -l If the number is 0 I want to check the information in the ALT field of heterozygous SNPs. grep "0/1" LPhs_BrahMP.vcf | head -n 5

Thanks.

OZTaekOppa commented 2 years ago

Hi @twolinin,

I have checked again.

XXXX@flashlite1:.../ASM_Mom/Phased/LongPhase> grep "|" LPhs_BrahMP.vcf | wc -l 3612187

XXXX@flashlite1:.../ASM_Mom/Phased/LongPhase> grep "0/1" LPhs_BrahMP.vcf | head -n 5 BrahChr1_MP1 93 BrahChr1_MP1_93_C C T, 4.66 . BaseQRankSum=-2.275;DP=505;ExcessHet=0;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=0.963;RAW_MQandDP=711891,505;ReadPosRankSum=-2.322 GT:AD:DP:GQ:PL:SB:PS 0/1:116,14,0:130:12:12,0,3571,359,3611,3970:59,57,6,8:. BrahChr1_MP1 146 BrahChr1_MP1_146_T T C, 3136.64 . BaseQRankSum=0.145;DP=1023;ExcessHet=0;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=0.352;RAW_MQandDP=1069663,1023;ReadPosRankSum=3.644 GT:AD:DP:GQ:PGT:PID:PL:SB:PS 0/1:69,136,0:205:99:0|1:146_T_C:3144,0,1678,3346,2072,5418:50,19,49,87:. BrahChr1_MP1 156 BrahChr1_MP1_156_C C G, 6574.64 . BaseQRankSum=2.072;DP=1040;ExcessHet=0;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=-0.27;RAW_MQandDP=1080891,1040;ReadPosRankSum=5.53 GT:AD:DP:GQ:PGT:PID:PL:SB:PS 0/1:43,176,0:219:99:0|1:146_T_C:6582,0,706,6705,1232,7937:33,10,68,108:. BrahChr1_MP1 162 BrahChr1_MP1_162_C C G, 6591.64 . BaseQRankSum=3.335;DP=1031;ExcessHet=0;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=0.121;RAW_MQandDP=1066721,1031;ReadPosRankSum=6.378 GT:AD:DP:GQ:PGT:PID:PL:SB:PS 0/1:42,176,0:218:99:0|1:146_T_C:6599,0,734,6720,1260,7980:32,10,67,109:. BrahChr1_MP1 660 BrahChr1_MP1_660_G G C, 418.64 . BaseQRankSum=-0.943;DP=621;ExcessHet=0;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=-1.449;RAW_MQandDP=1390168,621;ReadPosRankSum=3.329 GT:AD:DP:GQ:PL:SB:PS 0/1:232,28,0:260:99:426,0,8530,1120,8614,9734:137,95,15,13:.

FYI, I have used GATK 4.2 for calling SNPs (Illumina PE reads).

Thanks.

twolinin commented 2 years ago

Hi @OZTaekOppa

I guess the problem is the ALT field. You can remove the " ," in the VCF by command below and execute the longphase phase again.

sed 's/,<NON_REF>//g' your_unPhasedSNP.vcf

If the problem is still not solved, you can provide some BAM and VCF to help me find the problem faster.

Thanks.

OZTaekOppa commented 2 years ago

Hi @twolinin,

Thanks. That's what I thought. Some reason why the GATK generated the for calling SNPs. This is not the first time. So, I am testing with BCFtools and FreeBayes for calling SNPs again. And, I will try your suggestion too.

Thanks.

OZTaekOppa commented 2 years ago

Hi @twolinin,

I have tested again with BCFtools (SNP.vcf). However, I do not see any final phased fasta file at all.

/Extra_Programs/longphase/longphase phase \ -s /ASM_Fet/SNPCalls/BCFtools/marked_Brah_FetTPEDupRG_SNP2_var.flt.vcf \ --> SNP.vcf called from BCFtools --sv-file /ASM_Fet/cuteSV/BrahFPONT_cuteSVLPV9.vcf \ --> SV.vcf called from cuteSV -b /ASM_Fet/cuteSV/BrahFPONTV9.sorted.bam \ --> sorted.bam called from cuteSV -r /ASM_Fet/HapG/hapog_results/BrahN2_FetPwM.fasta \ -t 12 \ -o LPhs_BrahFP \ --ont \

This generated the below outcomes without any errors.

-rw-rw---- 1 XXXX Q4840RW 63819454 Jun 8 13:00 LPhs_BrahFP_SV.vcf -rw-rw---- 1 XXXX Q4840RW 1933444110 Jun 8 12:59 LPhs_BrahFP.vcf

############################################################################### parsing VCF ... 22s parsing SV VCF ... 0s reading reference ... 13s parsing contig/chromosome: BrahChr1_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 59s parsing contig/chromosome: BrahChr2_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 48s parsing contig/chromosome: BrahChr3_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 45s parsing contig/chromosome: BrahChr4_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 51s parsing contig/chromosome: BrahChr5_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 41s parsing contig/chromosome: BrahChr6_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 42s parsing contig/chromosome: BrahChr7_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 39s parsing contig/chromosome: BrahChr8_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 45s parsing contig/chromosome: BrahChr9_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 62s parsing contig/chromosome: BrahChr10_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 37s parsing contig/chromosome: BrahChr11_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 34s parsing contig/chromosome: BrahChr12_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 35s parsing contig/chromosome: BrahChr13_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 29s parsing contig/chromosome: BrahChr14_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 30s parsing contig/chromosome: BrahChr15_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 36s parsing contig/chromosome: BrahChr16_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 29s parsing contig/chromosome: BrahChr17_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 27s parsing contig/chromosome: BrahChr18_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 32s parsing contig/chromosome: BrahChr19_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 24s parsing contig/chromosome: BrahChr20_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 27s parsing contig/chromosome: BrahChr21_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 28s parsing contig/chromosome: BrahChr22_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 21s parsing contig/chromosome: BrahChr23_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 21s parsing contig/chromosome: BrahChr24_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 24s parsing contig/chromosome: BrahChr25_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 17s parsing contig/chromosome: BrahChr26_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 16s parsing contig/chromosome: BrahChr27_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 15s parsing contig/chromosome: BrahChr28_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 22s parsing contig/chromosome: BrahChr29_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 23s parsing contig/chromosome: BrahChrX_FP1 ... fetch SNP ... filter SNP ... run algorithm ... 23s parsing contig/chromosome: MtGnmP ... skip writeResult ... 66s write SV Result ... 4s ########################### Job Execution History ############################# JobId:244611.flashmgr2 UserName:XXXX GroupName:qris-uq JobName:BrahFP_LPhs SessionId:23273 ResourcesRequested:mem=200gb,ncpus=12,place=free,walltime=22:00:00 ResourcesUsed:cpupercent=175,cput=00:30:33,mem=41511928kb,ncpus=12,vmem=42719656kb,walltime=00:18:28 QueueUsed:General AccountString:qris-uq ExitStatus:0 ###############################################################################

/Extra_Programs/longphase/longphase haplotag \ -s /ASM_Fet/Phased/LongPhase/BCFSNP/LPhs_BrahFP.vcf \ --> phased SNP.vcf called from longphase --sv-file /ASM_Fet/Phased/LongPhase/BCFSNP/LPhs_BrahFP_SV.vcf \ --> phased SV.vcf called from longphase -b /ASM_Fet/cuteSV/BrahFPONTV9.sorted.bam \ --> sorted.bam called from cuteSV -t 12 \ -o LPhs_BrahFP_Result \

This generated the below outcomes without any errors.

-rw-rw---- 1 XXXX Q4840RW 31954304760 Jun 8 13:21 LPhs_BrahFP_Result.bam

############################################################################### phased SNP file: /ASM_Fet/Phased/LongPhase/BCFSNP/LPhs_BrahFP.vcf phased SV file: /ASM_Fet/Phased/LongPhase/BCFSNP/LPhs_BrahFP_SV.vcf input bam file: /ASM_Fet/cuteSV/BrahFPONTV9.sorted.bam output bam file: LPhs_BrahFP_Result.bam number of threads: 12 write log file: false log file:

filter mapping quality below: 20 percentage threshold: 0.6 tag supplementary: false

parsing SNP VCF ... 40s parsing SV VCF ... 1s tag read start ... chr: BrahChr1_FP1 ... 51s chr: BrahChr2_FP1 ... 37s chr: BrahChr3_FP1 ... 32s chr: BrahChr4_FP1 ... 33s chr: BrahChr5_FP1 ... 33s chr: BrahChr6_FP1 ... 32s chr: BrahChr7_FP1 ... 30s chr: BrahChr8_FP1 ... 43s chr: BrahChr9_FP1 ... 94s chr: BrahChr10_FP1 ... 28s chr: BrahChr11_FP1 ... 27s chr: BrahChr12_FP1 ... 25s chr: BrahChr13_FP1 ... 22s chr: BrahChr14_FP1 ... 22s chr: BrahChr15_FP1 ... 25s chr: BrahChr16_FP1 ... 22s chr: BrahChr17_FP1 ... 20s chr: BrahChr18_FP1 ... 43s chr: BrahChr19_FP1 ... 20s chr: BrahChr20_FP1 ... 19s chr: BrahChr21_FP1 ... 21s chr: BrahChr22_FP1 ... 17s chr: BrahChr23_FP1 ... 16s chr: BrahChr24_FP1 ... 17s chr: BrahChr25_FP1 ... 11s chr: BrahChr26_FP1 ... 14s chr: BrahChr27_FP1 ... 13s chr: BrahChr28_FP1 ... 34s chr: BrahChr29_FP1 ... 17s chr: BrahChrX_FP1 ... 28s chr: MtGnmP ... 0s tag read 846s

total process time: 887s total alignment: 21029397 total supplementary: 1319137 total secondary: 0 total unmapped: 0 total tag alignment: 8762479 total untagged: 3542710 ########################### Job Execution History ############################# JobId:244613.flashmgr2 UserName:XXXX GroupName:qris-uq JobName:BrahFP_LPhs SessionId:8794 ResourcesRequested:mem=200gb,ncpus=12,place=free,walltime=158:00:00 ResourcesUsed:cpupercent=752,cput=01:53:21,mem=3252340kb,ncpus=12,vmem=4452288kb,walltime=00:15:08 QueueUsed:General AccountString:qris-uq ExitStatus:0 ###############################################################################

According to the second part command (longphase haplotag: -o, --out-prefix=NAME prefix of phasing result. default:result), I expected to see the final phased fasta files (in the result folder) but "Phs_BrahFP_Result.bam."

I have even checked both phased SNP and SV.vcf files.

LPhs_BrahFP.vcf

FORMAT=

FORMAT=

INFO=

INFO=

INFO=

INFO=

bcftools_callVersion=1.12+htslib-1.12

bcftools_callCommand=call -mv -Ob -o marked_Brah_FetTPEDupRG_SNP2_var.raw.bcf; Date=Tue Jun 7 11:20:44 2022

bcftools_viewVersion=1.12+htslib-1.12

bcftools_viewCommand=view -i %QUAL>=20 marked_Brah_FetTPEDupRG_SNP2_var.raw.bcf; Date=Wed Jun 8 09:12:10 2022

FORMAT=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT BrahFetAll

BrahChr1_FP1 1049 . ATCAGCCAAATGATCAGTGCATGCAGTC A 30.3007 . INDEL;IDV=15;IMF=0.205479;DP=73;VDB=0.767633;SGB=-0.670168;MQSB=0.0901416;MQ0F=0;AC=1;AN=2;DP4=45,18,9,1;MQ=35 GT:PL:PS 0/1:64,0,185:. BrahChr1_FP1 3778 . G A 143.932 . DP=95;VDB=0.136722;SGB=-0.692067;RPB=0.000122914;MQB=0.947213;MQSB=0.959302;BQB=0.94205;MQ0F=0;AC=1;AN=2;DP4=31,31,13,7;MQ=43 GT:PL:PS 1|0:178,0,255:3778 BrahChr1_FP1 4759 . C T 39.0226 . DP=79;VDB=0.00714975;SGB=-0.692562;RPB=0.0474056;MQB=0.000309792;MQSB=0.960591;BQB=0.981795;MQ0F=0;AC=1;AN=2;DP4=37,18,21,1;MQ=39 GT:PL:PS 0|1:73,0,255:3778 BrahChr1_FP1 4761 . T C 35.9011 . DP=76;VDB=0.0072334;SGB=-0.692067;RPB=0.069019;MQB=0.00134558;MQSB=0.986074;BQB=0.864733;MQ0F=0;AC=1;AN=2;DP4=37,18,20,0;MQ=38 GT:PL:PS 0|1:70,0,255:3778 BrahChr1_FP1 4769 . G C 151.4 . DP=51;VDB=1.81478e-05;SGB=-0.692562;RPB=0.220609;MQB=0.125019;MQSB=0.978623;BQB=0.951977;MQ0F=0;AC=1;AN=2;DP4=14,5,20,2;MQ=39 GT:PL:PS 0|1:184,0,236:3778 BrahChr1_FP1 4770 . C G 152.349 . DP=53;VDB=0.000355274;SGB=-0.692976;RPB=0.0566624;MQB=0.0606217;MQSB=0.987253;BQB=0.987912;MQ0F=0;AC=1;AN=2;DP4=9,5,24,2;MQ=39 GT:PL:PS 0|1:185,0,215:3778

There is no NON_REF at all.

LPhs_BrahFP_SV.vcf

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

CommandLine="cuteSV BrahFPONTV9.sorted.bam /ASM_Fet/HapG/hapog_results/BrahN2_FetPwM.fasta BrahFPONT_cuteSVLPV9.vcf /ASM_Fet/cuteSV/ --threads 8 --report_readid --genotype --max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 --max_cluster_bias_DEL 100 --diff_ratio_merging_DEL 0.3"

FORMAT=

Did I miss something?

Thanks.

twolinin commented 2 years ago

Hi @OZTaekOppa

The result of using longphase phase will only produce phased vcf file. Using longphase haplotag will output phased bam. You can try bcftools consensus (http://samtools.github.io/bcftools/bcftools.html#consensus) to generate fasta files for each haplotype. You can identify the interval of each block through the PS tag in the phased VCF.

Thanks.

OZTaekOppa commented 2 years ago

Hi @twolinin,

Thank you for your reply.

According to the LongPhase, it generated these files.

-rw-rw---- 1 XXXX Q4840RW 95685361746 Jun 8 14:31 LPhs_BrahMP_Result.bam (from longphase haplotag) -rw-rw---- 1 XXXX Q4840RW 150895991 Jun 8 13:23 LPhs_BrahMP_SV.vcf (from longphase phase: PacBio long read) -rw-rw---- 1 XXXX Q4840RW 1879656315 Jun 8 13:23 LPhs_BrahMP.vcf (from longphase haplotag: Illumina PE reads)

And, the bcftools consensus requires a vcf file. Or, does it accept both vcf files (SNP and SV)?

bcftools consensus -i -s NA001 -f in.fa in.vcf.gz > out.fa

And, I could use for dipplod option as the below; bcftools consensus -i -s NA001 -f in.fa in.vcf.gz 1 R > out1.fa (Primary) bcftools consensus -i -s NA001 -f in.fa in.vcf.gz 2 A > out2.fa (Alternative)

Which vcf files do I have to use, LPhs_BrahMP_SV.vcf or LPhs_BrahMP.vcf? And, what's the LPhs_BrahMP_Result.bam for?

Thanks.

twolinin commented 2 years ago

Hi @OZTaekOppa

If you can only choose one I recommend using the phased SNP file (LPhs_BrahMP.vcf).

Longphase haplotag will classify reads into different haplotypes according to phased vcf. In LPhs_BrahMP_Result.bam you may find HP:i:1 or HP:i:2 which represents the haplotype this read belongs to.

Thanks.

xiekunwhy commented 1 year ago

Hi,

Can I use longphase to phase contigs and classify reads into different haplotypes, and then assemble each haplotype seperately ? How longphase essure that all HP:i:1 reads from different contigs represent the same haplotype?

Best, Kun

twolinin commented 1 year ago

Hi @xiekunwhy

Because the SNP may be unevenly distributed in the genome, this will cause the same contig to be broken into multiple blocks in the phased result. You need to check the HP and PS tags at the same time to ensure that the reads are in the same haplotype.

Blcok 1

28364444-10b0-4ec5-bcc7-4a7d18a01bdc    0       chr1    10001   ...     HP:i:2  PS:i:11772      PQ:i:40
24894e8b-27d1-4bb6-8f58-cc510c9d46ae    0       chr1    10001   ...     HP:i:2  PS:i:11772      PQ:i:40
3b99b313-d5e6-4021-b8aa-a6aa86471c3c    0       chr1    10001   ...     HP:i:1  PS:i:11772      PQ:i:9
...
244005c3-c048-42fd-9032-65c2df3b0026    0       chr1    54401   ...     HP:i:1  PS:i:11772      PQ:i:40
561e3d07-d089-40c0-9ebe-236a17c851ff    0       chr1    58533   ...     HP:i:2  PS:i:11772      PQ:i:40
19497736-fb34-4580-a349-032356e5fa7d    0       chr1    59179   ...     HP:i:1  PS:i:11772      PQ:i:40

Blcok 2

2c65251c-389e-4e8c-973d-f78746ae018f    0       chr1    89154   ...     HP:i:2  PS:i:89254      PQ:i:17
f91b5366-a6ae-4359-bae8-2284bebe4e45    16      chr1    89637   ...     HP:i:2  PS:i:89254      PQ:i:8
34f96928-8003-4b7b-b8be-1bfcf54c4ed3    0       chr1    91896   ...     HP:i:2  PS:i:89254      PQ:i:11
...
532b344b-df1d-4968-8183-b0e31319976e    16      chr1    180652  ...     HP:i:1  PS:i:89254      PQ:i:40
c355a3aa-af20-4df2-b7ca-e969f39609f5    0       chr1    180749  ...     HP:i:2  PS:i:89254      PQ:i:40
7a72c01f-2012-4f06-9e23-392e45f3cb7d    16      chr1    180751  ...     HP:i:1  PS:i:89254      PQ:i:6

Thanks

xiekunwhy commented 1 year ago

Hi @twolinin ,

Thank you for your reply, but I think my problem was not solved yet.

Actually, I want to know if HP:i:1 in Blcok 1 and HP:i:1 in Blcok 2 and HP:i:1 in block N and HP:i:1 chr2 Blcok N represent the same haplotype ?

If all HP:i:1 in the whole genome represent the same haplotype, I can extract these reads and assemble them to get a haploid genome.

Best, Kun

xiekunwhy commented 1 year ago

This is what I want to do, hifiasm haplotypes are not good enough because there are large SVs between two haplotypes

image

ythuang0522 commented 1 year ago

No. HP:i:1/2 is only relevant within the same block/contig/Chr. It looks not a good idea to do this on the basis of a fragmented assembly (the pctg genome?). If there is a chromosome-level (better) assembly of the same species, you may consider using the related genome instead for phasing and reads partition.

xiekunwhy commented 1 year ago

Ok, I see, so I need to anchor contigs to chromosome first.