vatlab / varianttools

software tool for the manipulation, annotation, selection, and analysis of variants in the context of next-gen sequencing analysis
https://vatlab.github.io/vat-docs/
GNU General Public License v3.0
31 stars 4 forks source link

Cannot add DP to variant fields #103

Open brankovicmarija opened 5 years ago

brankovicmarija commented 5 years ago

Hello! I am using version 3.0.2 trying to add DP to the available fields according to your tutorial with the following command:

vtools import ././SCN198.selected.vcf --build hg19 --var_info DP,qual,filter,id --geno_info DP_geno,AD_geno,qual --format vcf

First I get an unusual warning: WARNING: hg19: Failed to get reference sequence: unrecognized chromosome id: 65535 There is no such sequence in my file: $grep 65535 SCN198.selected.vcf chr12 82265535 . A G 35.44 PASS AC=2;AF=1.00;AN=2;DP=1;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=32.33;SOR=1.609GT:AD:DP:GQ:PL 1/1:0,1:1:3:45,3,0

Then, I get an error:

Importing genotypes: 100% [=================================================================================================] 531,172 37.9K/s in 00:00:14
Copying samples: 100% [=============================================================================================================] 1 0.2/s in 00:00:04
Current storage mode is HDF5, transfrom genotype storage mode.....
Creating indexes: 100% [============================================================================================================] 1 0.8/s in 00:00:01
Selecting genotypes: 100% [=========================================================================================================] 2 2.0/s in 00:00:01
Getting existing variants: 100% [==========================================================================================] 265,963 640.5K/s in 00:00:00
list index out of range
Process GenotypeImporter:
Traceback (most recent call last):
  File "/home/korisnik/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/korisnik/anaconda3/lib/python3.7/site-packages/variant_tools/importer_allele_hdf5.py", line 1135, in run
    self.get_geno(variant_id,i,altIndex)
  File "/home/korisnik/anaconda3/lib/python3.7/site-packages/variant_tools/importer_allele_hdf5.py", line 1078, in get_geno
    if "variants" in self.namedict[info]:
KeyError: 'DP'

Even though the process fails, the sample is still added to the list:

vtools show samples
test20      SCN198.selected.vcf
test20      ./eff.scn198.vcf
test20      ./SCN198.selected.vcf
test20      eff.scn198.vcf

When I try to remove the samples to try importing again, it fails:

$ vtools remove samples '*'
ERROR: Failed to retrieve samples by condition "(*)": near "*": syntax error
$ vtools remove samples 'filename = SCN198.selected.vcf'
ERROR: Failed to retrieve samples by condition "(filename = SCN198.selected.vcf)": no such column: SCN198.selected.vcf

Is there a systematic error here I am not seeing? How can I remove the sample and add DP to my fields?

jma7 commented 5 years ago

Hi brankovicmarija, Would you please remove the --format vcf, just try

vtools import ././SCN198.selected.vcf --build hg19 --var_info DP,qual,filter,id --geno_info DP_geno,AD_geno,qual

Since your file is in vcf format which is the default format.

brankovicmarija commented 5 years ago

Hi! I tried that but it returns this message: [:10])>0: TypeError: '>' not supported between instances of 'str' and 'int' Also tried this command grep ">0" eff.scn198.vcf and nothing found

jma7 commented 5 years ago

Have you tried reinitiate the project from the beginning?

vtools init myproject -f

Then, import vcf again.

BoPeng commented 5 years ago

@brankovicmarija Could you send us a minimal example to demonstrate the problem, such as the first 1000 lines of your vcf file and exact commands that you have used?

brankovicmarija commented 5 years ago

@jma7 I tried but it is the same error :[:10])>0: TypeError: '>' not supported between instances of 'str' and 'int'

@BoPeng Here is the command: vtools import ./SCN198.selected.vcf --build hg19 --var_info DP,qual,filter,id --geno_info DP_geno,AD_geno,qual And here is the vcf: ##fileformat=VCFv4.2

FILTER=

FILTER= 200.0 || ReadPosRankSum < -20.0 || SOR > 10.0">

FILTER= 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 3.0">

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller Version=4.1.0.0,Date="March 1, 2019 3:18:49 PM CET">

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

reference=file:///home/korisnik/Downloads/wg.fa

source=HaplotypeCaller

source=SelectVariants

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT test20

chr10 64863 . A C 37.28 my_snp_filter AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=27.00;QD=18.64;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:49,6,0 chr10 64863 . A C 37.28 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=27.00;QD=18.64;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:49,6,0 chr10 68091 . T C 37.28 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=18.64;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:49,6,0 chr10 68091 . T C 37.28 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=18.64;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:49,6,0 chr10 74448 . G T 37.28 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=47.00;QD=18.64;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:49,6,0 chr10 74448 . G T 37.28 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=47.00;QD=18.64;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:49,6,0 chr10 93603 . C T 143.10 my_snp_filter AC=2;AF=1.00;AN=2;DP=4;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=37.18;QD=25.36;SOR=1.609 GT:AD:DP:GQ:PL 1/1:0,4:4:12:157,12,0 chr10 93603 . C T 143.10 PASS AC=2;AF=1.00;AN=2;DP=4;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=37.18;QD=25.36;SOR=1.609 GT:AD:DP:GQ:PL 1/1:0,4:4:12:157,12,0 chr10 93816 . C T 362.60 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.030;DP=21;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=58.85;MQRankSum=-0.686;QD=17.27;ReadPosRankSum=-0.943;SOR=0.458 GT:AD:DP:GQ:PL 0/1:8,13:21:99:370,0,242 chr10 93816 . C T 362.60 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.030;DP=21;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=58.85;MQRankSum=-0.686;QD=17.27;ReadPosRankSum=-0.943;SOR=0.458 GT:AD:DP:GQ:PL 0/1:8,13:21:99:370,0,242 chr10 93945 . G A 278.98 PASS AC=2;AF=1.00;AN=2;DP=7;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=54.81;QD=28.73;SOR=4.174 GT:AD:DP:GQ:PL 1/1:0,7:7:21:293,21,0 chr10 93945 . G A 278.98 PASS AC=2;AF=1.00;AN=2;DP=7;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=54.81;QD=28.73;SOR=4.174 GT:AD:DP:GQ:PL 1/1:0,7:7:21:293,21,0 chr10 94083 . C T 33.44 PASS AC=2;AF=1.00;AN=2;DP=1;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=40.00;QD=33.44;SOR=1.609 GT:AD:DP:GQ:PL 1/1:0,1:1:3:43,3,0 chr10 94083 . C T 33.44 PASS AC=2;AF=1.00;AN=2;DP=1;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=40.00;QD=33.44;SOR=1.609 GT:AD:DP:GQ:PL 1/1:0,1:1:3:43,3,0 chr10 95226 . A T 67.28 PASS AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=33.64;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:79,6,0

jma7 commented 5 years ago

Hi, The --var_info option in the vtools import command line was meant to import fields in the INFO column from the vcf file, currently DP,GQ,GD,HQ,AD,PL,MQ,NS tags are supported. So you might want to adjust your command line to include the tags you are interested in, for example

vtools import SCN198.selected.vcf --build hg19 --var_info DP AC AF --geno_info DP_geno,AD_geno

We will improve our error reporting to make this more clear, Thanks!

brankovicmarija commented 5 years ago

Thanks for the reply.

We reset the project and tried your command. There were many non-breaking key errors during the execution e.g.:

Traceback (most recent call last):
  File "/home/korisnik/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/korisnik/anaconda3/lib/python3.7/site-packages/variant_tools/importer_allele_hdf5.py", line 1132, in run
    variant_id  = self.variantIndex[tuple((chr, ref, alt))][pos][0]
KeyError: 110874502

Looking at the VCF file there were always locations one position smaller. I also remind you of my first post and the problem with warning "unrecognized chromosome id: 65535".

At the end, we show fields:

variant.chr (char)      Chromosome name (VARCHAR)
variant.pos (int)       Position (INT, 1-based)
variant.ref (char)      Reference allele (VARCHAR, - for missing allele of an
                        insertion)
variant.alt (char)      Alternative allele (VARCHAR, - for missing allele of an
                        deletion)
variant.DP (int)
variant.AC (int)
variant.AF (float)
variant.num (int)       Created from stat "#(alt)"  with type INT on Apr19
variant.hom (int)       Created from stat "#(hom)"  with type INT on Apr19
variant.het (int)       Created from stat "#(het)"  with type INT on Apr19
variant.other (int)     Created from stat "#(other)"  with type INT on Apr19
variant.total (int)     Created from stat "#(GT)"  with type INT on Apr19

Any clue why DP_geno was not generated?

Thanks!

jma7 commented 5 years ago

vtools show fields will only show the fields in the variant table. DP_geno is for each sample which is saved in *.h5 file. I am not sure what is causing the one position smaller problem you mentioned. It seems that there are duplication of variants in your VCF file, but the sample file was imported fine at my end. Would you please send me the portion of the VCF file having the key error problem, maybe plus minus 100 lines around position 110874502?