sigven / gvanno

Generic human DNA variant annotation pipeline
56 stars 11 forks source link

issue with annotating simple vcf only contains 8 columns of VCFv4.1 #8

Closed ipstone closed 3 years ago

ipstone commented 4 years ago

Hello,

Thank you for your work on gvanno! I have gotten gvanno working on my centOS 7 box, and the example annotation ran well. However, when I am trying to annotate a simple vcf file like the following, I ran into the error (probably on every line of variant):

...
            ERROR: Line ...: Format is not a colon-separated list of alphanumeric strings.
            ERROR: Line ....: Format is not a colon-separated list of alphanumeric strings.
            ERROR: Line ....: Format is not a colon-separated list of alphanumeric strings.
            ERROR: Line ....: Format is not a colon-separated list of alphanumeric strings.

The VCF file looks like the following:

##fileformat=VCFv4.1
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
1       2115900 .       T       C       .       PASS    AN=3646;AC=11   GT
1       2115911 .       C       G       .       PASS    AN=3646;AC=2    GT
1       2115912 .       G       A       .       PASS    AN=3646;AC=1    GT
1       2115999 .       C       T       .       PASS    AN=3646;AC=4    GT
1       2116124 .       C       G       .       PASS    AN=3646;AC=10,0 GT
....

What might be the cause of the error? Is there a way to format the vcf to get it properly annotated?

Thanks in advance!

-- ipstone

sigven commented 4 years ago

Hi @ipstone, Thanks for reaching out. If you look more closely at the excerpt of your VCF file above, it contains 9 columns and not 8. A quick suggestion is that you simply remove the 9th (FORMAT) column, since you do not have any sample (genotype) data present (this should occur in column 10 if you have a FORMAT column, and I suspect that is the reason it fails).

Try to see if that might give you success.

best, Sigve

ipstone commented 4 years ago

Thanks @sigven for picking out the format issue, it helps.

After removing the extra column, I noticed some other issue/s, as my input vcf file was originally a phased germline variants file, so some lines have multiple alleles info. With the vcf_validate option kept on, it would run into warning/error of

ERROR: Line 158011: INFO SVTYPE must be one of: BND, CNV, DEL, DUP, INS, INV. Found SVTYPE was 'MEI'.
           ERROR: Line 163444: INFO SVTYPE must be one of: BND, CNV, DEL, DUP, INS, INV. Found SVTYPE was 'MEI'.
           ERROR: Line 164819: INFO SVTYPE must be one of: BND, CNV, DEL, DUP, INS, INV. Found SVTYPE was 'MEI'.
           ERROR: Line 166354: INFO SVTYPE must be one of: BND, CNV, DEL, DUP, INS, INV. Found SVTYPE was 'MEI'.

If given --no_vcf_validate option, gvanno ran all the way, but gave an error when exporting the tsv file (I think):

2020-09-15 20:51:07 - gvanno-gene-annotate - INFO - Completed summary of functional annotations for 9345 variants on chromosome 16
2020-09-15 20:51:17 - gvanno-gene-annotate - INFO - Completed summary of functional annotations for 11197 variants on chromosome 17
2020-09-15 20:51:18 - gvanno-gene-annotate - INFO - Completed summary of functional annotations for 4889 variants on chromosome 18
Traceback (most recent call last):
  File "/gvanno/gvanno_summarise.py", line 125, in <module>
    if __name__=="__main__": __main__()
  File "/gvanno/gvanno_summarise.py", line 23, in __main__
    extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction)
  File "/gvanno/gvanno_summarise.py", line 91, in extend_vcf_annotations
    csq_record_results = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ')
  File "/gvanno/lib/annoutils.py", line 705, in parse_vep_csq
    assign_cds_exon_intron_annotations(csq_record)
  File "/gvanno/lib/annoutils.py", line 454, in assign_cds_exon_intron_annotations
    exon_pos_info = csq_record['NearestExonJB'].split("+")
AttributeError: 'NoneType' object has no attribute 'split'

might it be an issue coming from :

gvanno-validate-input - WARNING - Multiallelic site detected:8    48317851        CGTGTGTGT       CGTGTGT,CG
TGT,CGT,CGTGTGTGTGT,C,CGTGTGTGTGTGT 

I have parsed the mutiple allelic form of VCF to one varianet per line, will test again it that helps.

Thanks again , I think it's making progress. There is already the '...gvanno_ready.vep.vcfanno.annotated.vcf' output file, I guess I could manually parse out the annotation if I still have trouble with the last step in python.

ipstone commented 4 years ago

Using the one variant per line (formatted input vcf), still run to similar error at line 125:

2020-09-15 21:51:31 - gvanno-gene-annotate - INFO - Completed summary of functional annotations for 11197 variants on chr[10/2828]
7                                                                                                                                 
2020-09-15 21:51:33 - gvanno-gene-annotate - INFO - Completed summary of functional annotations for 4889 variants on chromosome 18
Traceback (most recent call last):                                                                                                
  File "/gvanno/gvanno_summarise.py", line 125, in <module>                                                                       
    if __name__=="__main__": __main__()                                                                                           
  File "/gvanno/gvanno_summarise.py", line 23, in __main__                                                                        
    extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction)                                                
  File "/gvanno/gvanno_summarise.py", line 91, in extend_vcf_annotations                                                          
    csq_record_results = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 
'CSQ')                                                                                                                            
  File "/gvanno/lib/annoutils.py", line 705, in parse_vep_csq
    assign_cds_exon_intron_annotations(csq_record)
  File "/gvanno/lib/annoutils.py", line 454, in assign_cds_exon_intron_annotations
    exon_pos_info = csq_record['NearestExonJB'].split("+")
AttributeError: 'NoneType' object has no attribute 'split' 

This line linked to annoutils.py line 454 : https://github.com/sigven/gvanno/blob/7c6affd7ddc1badeb1e20a415af5729faacbfd0a/src/gvanno/lib/annoutils.py#L454 might it be a csq_record NoneType error? - perhaps adding a None check here?

sigven commented 4 years ago

Looks like you have hit the target there. If you are able to share your VCF with me (or some parts of it) along with your assembly and configuration file, I can make a more robust check when looking for Exon Junction information (i.e. check for None), and tests that it works for your case.

thanks, Sigve

sigven commented 3 years ago

Fixed