ifishlin / HAHap

A read-based haplotyping tool using hierarchical assembly
1 stars 0 forks source link

UnicodeDecodeError #2

Open dhwani2410 opened 5 years ago

dhwani2410 commented 5 years ago

=== Start HAHap phasing === Parameters: Minimum mapping quality = 0 Parameters: Threshold of low coverage = Median Parameters: Minimum junction number = 4 Parameters: Likelihood of P1 and P2 = 0.49

=== Read Heterozygous Data === Traceback (most recent call last): File "./bin/HAHap", line 9, in main() File "/home/dhwani/Documents/softwares/HAHap/HAHap/main.py", line 73, in main module.main(args) File "/home/dhwani/Documents/softwares/HAHap/HAHap/phase.py", line 56, in main var_chrom_dict = split_vcf_by_chrom(args.variant_file) File "/home/dhwani/Documents/softwares/HAHap/HAHap/vcf.py", line 42, in split_vcf_by_chrom for line in variants_vcf: File "/home/dhwani/miniconda3/lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

ifishlin commented 5 years ago

please, check your VCF encoding, it should be encoded as UTF-8.

dhwani2410 commented 5 years ago

when i use the example data then also i get the same error

./bin/HAHap phase data/HG002.hs37d5.2x250.bam HG002_heter.vcf.gz out_sample.txt

=== Start HAHap phasing === Parameters: Minimum mapping quality = 0 Parameters: Threshold of low coverage = Median Parameters: Minimum junction number = 4 Parameters: Likelihood of P1 and P2 = 0.49

=== Read Heterozygous Data === Traceback (most recent call last): File "./bin/HAHap", line 9, in main() File "/home/dhwani/Documents/softwares/HAHap/HAHap/main.py", line 73, in main module.main(args) File "/home/dhwani/Documents/softwares/HAHap/HAHap/phase.py", line 56, in main var_chrom_dict = split_vcf_by_chrom(args.variant_file) File "/home/dhwani/Documents/softwares/HAHap/HAHap/vcf.py", line 44, in split_vcf_by_chrom for line in variants_vcf: File "/home/dhwani/miniconda3/lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

ifishlin commented 5 years ago

Hi, please unzip the vcf.gz(input is the text file). I uploaded the gz file only for storage issue.

dhwani2410 commented 5 years ago

@ifishlin it worked with sample file after I unzipped the VCF. I also checked my VCF file that it is UTF8 encoded. I have uploaded the vcf file in a tab-delimited file as VCF extension was not supported here

vcf_tab_delimited.txt

Can you please have a look at this file and let me know what could have been a possible source of error?

ifishlin commented 5 years ago

remove the " in the file, ex (1). "1/1:7,911:918:99:26237,2450,0" => 1/1:7,911:918:99:26237,2450,0 (2). "##FILTER=<ID=LowQual,Description=""Low quality"">" => ##FILTER=<ID=LowQual,Description=""Low quality"">

dhwani2410 commented 5 years ago

The comma appeared in file may be because of the conversion of vcf to txt file. I am sending first few lines of VCF file for exact details

fileformat=VCFv4.2

FILTER=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --standard-min-confidence-threshold-for-calling 40.0 --dbsnp /home/dhwani.dholakia/archive/files_required_for_exome_analysis/dbsnp/GRCH37.p17_refseq.vcf --output haplotype_caller/DRR015476_haplotyper.vcf --intervals /home/dhwani.dholakia/archive/files_required_for_exome_analysis/hla_one_line.bed --input base_recalib/DRR015476_aligned_sorted_dupmarked_realigned_recalibrated.bam --reference /home/dhwani.dholakia/archive/files_required_for_exome_analysis/reference/Homo_sapiens.GRCh37.dna.chromosome.6.fa --use-new-qual-calculator true --use-old-qual-calculator false --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --max-alternate-alleles 6 --max-genotype-count 1024 --sample-ploidy 2 --num-reference-samples-if-no-call 0 --genotyping-mode DISCOVERY --genotype-filtered-alleles false --contamination-fraction-to-filter 0.0 --output-mode EMIT_VARIANTS_ONLY --all-site-pls false --gvcf-gq-bands 1 --gvcf-gq-bands 2 --gvcf-gq-bands 3 --gvcf-gq-bands 4 --gvcf-gq-bands 5 --gvcf-gq-bands 6 --gvcf-gq-bands 7 --gvcf-gq-bands 8 --gvcf-gq-bands 9 --gvcf-gq-bands 10 --gvcf-gq-bands 11 --gvcf-gq-bands 12 --gvcf-gq-bands 13 --gvcf-gq-bands 14 --gvcf-gq-bands 15 --gvcf-gq-bands 16 --gvcf-gq-bands 17 --gvcf-gq-bands 18 --gvcf-gq-bands 19 --gvcf-gq-bands 20 --gvcf-gq-bands 21 --gvcf-gq-bands 22 --gvcf-gq-bands 23 --gvcf-gq-bands 24 --gvcf-gq-bands 25 --gvcf-gq-bands 26 --gvcf-gq-bands 27 --gvcf-gq-bands 28 --gvcf-gq-bands 29 --gvcf-gq-bands 30 --gvcf-gq-bands 31 --gvcf-gq-bands 32 --gvcf-gq-bands 33 --gvcf-gq-bands 34 --gvcf-gq-bands 35 --gvcf-gq-bands 36 --gvcf-gq-bands 37 --gvcf-gq-bands 38 --gvcf-gq-bands 39 --gvcf-gq-bands 40 --gvcf-gq-bands 41 --gvcf-gq-bands 42 --gvcf-gq-bands 43 --gvcf-gq-bands 44 --gvcf-gq-bands 45 --gvcf-gq-bands 46 --gvcf-gq-bands 47 --gvcf-gq-bands 48 --gvcf-gq-bands 49 --gvcf-gq-bands 50 --gvcf-gq-bands 51 --gvcf-gq-bands 52 --gvcf-gq-bands 53 --gvcf-gq-bands 54 --gvcf-gq-bands 55 --gvcf-gq-bands 56 --gvcf-gq-bands 57 --gvcf-gq-bands 58 --gvcf-gq-bands 59 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --gvcf-gq-bands 99 --indel-size-to-eliminate-in-ref-model 10 --use-alleles-trigger false --disable-optimizations false --just-determine-active-regions false --dont-genotype false --do-not-run-physical-phasing false --use-filtered-reads-for-annotations false --correct-overlapping-quality false --adaptive-pruning false --do-not-recover-dangling-branches false --recover-dangling-heads false --consensus false --dont-trim-active-regions false --max-disc-ar-extension 25 --max-gga-ar-extension 300 --padding-around-indels 150 --padding-around-snps 20 --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --max-num-haplotypes-in-population 128 --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --max-unpruned-variants 100 --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --pair-hmm-gap-continuation-penalty 10 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --min-base-quality-score 10 --smith-waterman JAVA --emit-ref-confidence NONE --max-mnp-distance 0 --min-assembly-region-size 50 --max-assembly-region-size 300 --assembly-region-padding 100 --max-reads-per-alignment-start 50 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false",Version="4.1.2.0",Date="May 3, 2019 7:40:26 PM IST">

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

contig=

source=HaplotypeCaller

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT DRR015476

6 31321429 rs2596499 T A 26223.03 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.442;DB;DP=918;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.99;MQRankSum=-0.075;QD=28.57;ReadPosRankSum=-0.221;SOR=1.958 GT:AD:DP:GQ:PL 1/1:7,911:918:99:26237,2450,0 6 31321524 rs2844584 G A 35485.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=10.780;DB;DP=2609;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=59.89;MQRankSum=0.093;QD=13.69;ReadPosRankSum=1.542;SOR=0.704 GT:AD:DP:GQ:PL 0/1:1122,1471:2593:99:35493,0,22929 6 31321578 rs7762909 A G 103935.03 . AC=2;AF=1.00;AN=2;BaseQRankSum=-3.195;DB;DP=3989;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=26.16;ReadPosRankSum=-0.267;SOR=0.958 GT:AD:DP:GQ:PL 1/1:10,3963:3973:99:103949,11666,0 6 31321807 rs2770 G A 276230.03 . AC=2;AF=1.00;AN=2;BaseQRankSum=0.079;DB;DP=6786;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.98;MQRankSum=-5.602;QD=31.11;ReadPosRankSum=0.679;SOR=0.637 GT:AD:DP:GQ:PL 1/1:3,6775:6778:99:276244,20337,0 6 31321856 rs2768 A G 175572.03 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.992;DB;DP=6327;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.96;MQRankSum=1.265;QD=27.80;ReadPosRankSum=3.991;SOR=0.986 GT:AD:DP:GQ:PL 1/1:13,6302:6315:99:175586,18623,0 6 31321882 rs2769 G A 86240.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=19.850;DB;DP=6343;ExcessHet=3.0103;FS=0.530;MLEAC=1;MLEAF=0.500;MQ=59.88;MQRankSum=-5.425;QD=13.63;ReadPosRankSum=4.167;SOR=0.644 GT:AD:DP:GQ:PL 0/1:2811,3517:6328:99:86248,0,56712 6 31321906 rs1093 A G 71375.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-22.761;DB;DP=6061;ExcessHet=3.0103;FS=1.767;MLEAC=1;MLEAF=0.500;MQ=59.76;MQRankSum=-1.855;QD=11.79;ReadPosRankSum=2.937;SOR=0.836 GT:AD:DP:GQ:PL 0/1:2702,3350:6052:99:71383,0,87556 6 31321915 rs1055890 A G 208398.05 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.680;DB;DP=5817;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.74;MQRankSum=0.514;QD=30.50;ReadPosRankSum=2.560;SOR=0.439 GT:AD:DP:GQ:PL 1/1:9,5808:5817:99:263663,17466,0 6 31321916 rs1055849 A G 205947.03 . AC=2;AF=1.00;AN=2;BaseQRankSum=-1.288;DB;DP=5824;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.72;MQRankSum=1.344;QD=27.63;ReadPosRankSum=2.409;SOR=0.499 GT:AD:DP:GQ:PL 1/1:11,5807:5818:99:205961,17142,0 6 31321925 rs140769830 T TG 50810.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.994;DB;DP=5672;ExcessHet=3.0103;FS=0.528;MLEAC=1;MLEAF=0.500;MQ=59.68;MQRankSum=0.658;QD=9.02;ReadPosRankSum=-0.256;SOR=0.617 GT:AD:DP:GQ:PL 0/1:3170,2460:5630:99:50818,0,78559 6 31322121 rs2428496 C T 227959.03 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.090;DB;DP=5464;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.534;QD=32.30;ReadPosRankSum=-0.656;SOR=1.833 GT:AD:DP:GQ:PL 1/1:2,5456:5458:99:227973,16424,0 6 31322129 rs17192932 C G 49114.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.907;DB;DP=5363;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=9.23;ReadPosRankSum=-2.843;SOR=0.671 GT:AD:DP:GQ:PL 0/1:2919,2402:5321:99:49122,0,64970 6 31322175 rs2428495 C T 206930.03 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.195;DB;DP=4582;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.99;MQRankSum=-5.558;QD=26.57;ReadPosRankSum=1.369;SOR=2.667 GT:AD:DP:GQ:PL 1/1:2,4579:4581:99:206944,13925,0 6 31322197 rs2428494 T A 179442.03 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.948;DB;DP=4191;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.99;MQRankSum=0.797;QD=27.12;ReadPosRankSum=0.940;SOR=1.228 GT:AD:DP:GQ:PL 1/1:2,4183:4185:99:179456,12551,0 6 31322220 . C T 163738.06 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.640;DP=3744;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.99;MQRankSum=-0.044;QD=29.14;ReadPosRankSum=2.033;SOR=2.737 GT:AD:DP:GQ:PL 1/1:3,3649:3652:99:163752,11718,0 6 31322367 rs3819299 T G 32668.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-18.180;DB;DP=4089;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=59.98;MQRankSum=1.866;QD=8.01;ReadPosRankSum=0.049;SOR=0.670 GT:AD:DP:GQ:PL 0/1:2370,1708:4078:99:32676,0,57498 6 31322395 rs17199328 A G 51664.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-16.507;DB;DP=3895;ExcessHet=3.0103;FS=1.834;MLEAC=1;MLEAF=0.500;MQ=59.86;MQRankSum=-2.171;QD=13.31;ReadPosRankSum=-0.806;SOR=0.868 GT:AD:DP:GQ:PL 0/1:1583,2298:3881:99:51672,0,34154