rpetit3 / vcf-annotator

Add biological annotations to variants in a given VCF file.
MIT License
26 stars 7 forks source link

KeyError #9

Open KasperH2 opened 3 years ago

KasperH2 commented 3 years ago

Hello I tried using your program, but i keep getting the error

Traceback (most recent call last): File "/home/pato/miniconda3/envs/vcf_anno/bin/vcf-annotator", line 393, in annotator.annotate_vcf_records() File "/home/pato/miniconda3/envs/vcf_anno/bin/vcf-annotator", line 67, in annotate_vcf_records self.__gb.accession = record.CHROM File "/home/pato/miniconda3/envs/vcf_anno/bin/vcf-annotator", line 220, in accession self.__gb = self.records[value] KeyError: 'K02718.1'

I am using a viral reference fasta and vcf and in the chromosome column of the vcf name is K02718.1. I've used the command vcf-annotator K02718.1.align.vcf K02718.1.gb

I've tried using vcf files generated by freebayes and vcf generated by GATK HaplotypeCaller

I tried changing the chrome column to one number, which gives the keyerror returning that number.
The style of the vcf is as follows:

image

Do you have an idea of the problem? Thank you very much for your program and help!

rpetit3 commented 3 years ago

@KasperH2 I apologize for the delay, you caught me during a cross country move.

Please let me know if you want me to look into this further.

marimaro commented 3 years ago

Hello,

I'm facing the same problem, did you manage to find a solution?

Thanks.

Marina

rpetit3 commented 3 years ago

Hello!

Can I get a VCF and GenBank file to figure this out?

Thank you!

rpetit3 commented 3 years ago

Hi @marimaro or @KasperH2

Just following up to see if you had a VCF and GenBank file you could share.

Thank you!

marimaro commented 3 years ago

Hi Robert,

Sorry for the delay! In the end I managed to run vcf-annotator.

However, when I tried running with a genbank file directly downloaded from NCBI (this one), it didn't work, but when I made a gb file from a fasta and gff using seqret, it worked.

Also, just for the record, my VCF files had some '*' characters that I had to remove to successfully run vcf-annotator.

Thanks for your time,

rpetit3 commented 3 years ago

Thank you for bringing up * I'll work on getting that replaced.

I'll also play around with the GenBank file.

rpetit3 commented 3 years ago

@marimaro last request, do you by chance have a VCF that I could test. If not it's ok, I'll make a fake one that should match your issues

rpetit3 commented 3 years ago

Alright so I think the issue is, vcf-annotator is expecting the CHROM field in the VCF to match the ACCESSION in in the GenBank file. Except in your VCFs the CHROM matches the VERSION.

ACCESSION   K02718
VERSION     K02718.1

So what I'm thinking is I'll add a check something like:

try:
    accession = CHROM[ACCESSION]
except KEYERROR:
    accession = CHROM[VERSION]

But it would be really useful to have a good example VCF.

rpetit3 commented 3 years ago

Ok, I think I fixed the KeyError issue.

Example VCF

##fileformat=VCFv4.1
##contig=<ID=1,length=29903>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
NC_045512.2     25      .       T       G       .       .       .
NC_045512.2     241     .       C       T       .       .       .
NC_045512.2     512     .       C       T       .       .       .
NC_045512.2     514     .       T       C       .       .       .
NC_045512.2     520     .       G       T       .       .       .

Example Genbank

LOCUS       NC_045512              29903 bp ss-RNA     linear   VRL 18-JUL-2020
DEFINITION  Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1,
            complete genome.
ACCESSION   NC_045512
VERSION     NC_045512.2
DBLINK      BioProject: PRJNA485481
KEYWORDS    RefSeq.

Ouput annotated VCF

##fileformat=VCFv4.1
##INFO=<ID=RefCodon,Number=.,Type=String,Description="Reference codon">
##INFO=<ID=AltCodon,Number=.,Type=String,Description="Alternate codon">
##INFO=<ID=RefAminoAcid,Number=.,Type=String,Description="Reference amino acid">
##INFO=<ID=AltAminoAcid,Number=.,Type=String,Description="Alternate amino acid">
##INFO=<ID=CodonPosition,Number=1,Type=Integer,Description="Codon position in the gene">
##INFO=<ID=SNPCodonPosition,Number=1,Type=Integer,Description="SNP position in the codon">
##INFO=<ID=AminoAcidChange,Number=.,Type=String,Description="Amino acid change">
##INFO=<ID=IsSynonymous,Number=1,Type=Integer,Description="0:nonsynonymous, 1:synonymous, 9:N/A or Unknown">
##INFO=<ID=IsTransition,Number=1,Type=Integer,Description="0:transversion, 1:transition, 9:N/A or Unknown">
##INFO=<ID=IsGenic,Number=1,Type=Integer,Description="0:intergenic, 1:genic">
##INFO=<ID=IsPseudo,Number=1,Type=Integer,Description="0:not pseudo, 1:pseudo gene">
##INFO=<ID=LocusTag,Number=.,Type=String,Description="Locus tag associated with gene">
##INFO=<ID=Gene,Number=.,Type=String,Description="Name of gene">
##INFO=<ID=Note,Number=.,Type=String,Description="Note associated with gene">
##INFO=<ID=Inference,Number=.,Type=String,Description="Inference of feature.">
##INFO=<ID=Product,Number=.,Type=String,Description="Description of gene">
##INFO=<ID=ProteinID,Number=.,Type=String,Description="Protein ID of gene">
##INFO=<ID=Comments,Number=.,Type=String,Description="Example: Negative strand: T->C">
##INFO=<ID=VariantType,Number=.,Type=String,Description="Indel, SNP, Ambiguous_SNP">
##INFO=<ID=FeatureType,Number=.,Type=String,Description="The feature type of variant.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1,length=29903>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
NC_045512.2     25      .       T       G       .       .       RefCodon=.;AltCodon=.;RefAminoAcid=.;AltAminoAcid=.;CodonPosition=.;SNPCodonPosition=.;AminoAcidChange=.;IsSynonymous=9;IsTransition=0;IsGenic=0;IsPseudo=0;LocusTag=.;Gene=.;Note=.;Inference=.;Product=.;ProteinID=.;Comments=.;VariantType=SNP;FeatureType=inter_genic
NC_045512.2     241     .       C       T       .       .       RefCodon=.;AltCodon=.;RefAminoAcid=.;AltAminoAcid=.;CodonPosition=.;SNPCodonPosition=.;AminoAcidChange=.;IsSynonymous=9;IsTransition=1;IsGenic=0;IsPseudo=0;LocusTag=.;Gene=.;Note=.;Inference=.;Product=.;ProteinID=.;Comments=.;VariantType=SNP;FeatureType=inter_genic
NC_045512.2     512     .       C       T       .       .       RefCodon=CAT;AltCodon=TAT;RefAminoAcid=H;AltAminoAcid=Y;CodonPosition=83;SNPCodonPosition=0;AminoAcidChange=H83Y;IsSynonymous=0;IsTransition=1;IsGenic=1;IsPseudo=0;LocusTag=GU280_gp01;Gene=ORF1ab;Note=pp1a;Inference=.;Product=ORF1a[space]polyprotein;ProteinID=YP_009725295.1;Comments=.;VariantType=SNP;FeatureType=CDS
NC_045512.2     514     .       T       C       .       .       RefCodon=CAT;AltCodon=CAC;RefAminoAcid=H;AltAminoAcid=H;CodonPosition=83;SNPCodonPosition=2;AminoAcidChange=H83H;IsSynonymous=1;IsTransition=1;IsGenic=1;IsPseudo=0;LocusTag=GU280_gp01;Gene=ORF1ab;Note=pp1a;Inference=.;Product=ORF1a[space]polyprotein;ProteinID=YP_009725295.1;Comments=.;VariantType=SNP;FeatureType=CDS
NC_045512.2     520     .       G       T       .       .       RefCodon=ATG;AltCodon=ATT;RefAminoAcid=M;AltAminoAcid=I;CodonPosition=85;SNPCodonPosition=2;AminoAcidChange=M85I;IsSynonymous=0;IsTransition=0;IsGenic=1;IsPseudo=0;LocusTag=GU280_gp01;Gene=ORF1ab;Note=pp1a;Inference=.;Product=ORF1a[space]polyprotein;ProteinID=YP_009725295.1;Comments=.;VariantType=SNP;FeatureType=CDS
rpetit3 commented 3 years ago

I will need an example for VCF's with '*' in them. Unless this is a good example: https://github.com/rpetit3/vcf-annotator/issues/6#issuecomment-763676622

marimaro commented 3 years ago

Awesome! Yes, that is a good example of what I had in my VCF files.

BioWilko commented 2 years ago

Here's another example if you're still looking into this @rpetit3 calls.vcf.txt

rpetit3 commented 2 years ago

Awesome thank you @BioWilko

rpetit3 commented 2 years ago

I just pushed v0.7 with a fix for this issue: https://github.com/rpetit3/vcf-annotator/releases/tag/v0.7

Please let me know if that's not the case!

rpetit3 commented 2 years ago

The Key Error issue specifically, not the * issue.