projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
263 stars 110 forks source link

VEP exone/intron annotations failure #479

Closed project-defiant closed 2 years ago

project-defiant commented 2 years ago

Dear, maintainers,

I am running glow with spark-standalone mode to save annotated with vep vcf files to parquet. I am using jupyter notebooks for this process.

I have found out, that some of the variants parsed by

spark.read.format("vcf").load("path/to/vcf")

are not preserving INFO_CSQ field from annotations. here is the output of show command on some of the variants:

-RECORD 0-------------------------------------
 contigName            | 16                   
 start                 | 46387349             
 end                   | 46434780             
 names                 | null                 
 referenceAllele       | T                    
 alternateAlleles      | [<CNV>]              
 qual                  | null                 
 filters               | null                 
 splitFromMultiAllelic | false                
 INFO_END              | null                 
 INFO_CSQ              | [{copy_number_var... 
 INFO_SVTYPE           | CNV                  
 genotypes             | [{NORMAL, 1, 2, f... 
-RECORD 1-------------------------------------
 contigName            | 16                   
 start                 | 78371757             
 end                   | 78384528             
 names                 | null                 
 referenceAllele       | C                    
 alternateAlleles      | [<CNV>]              
 qual                  | null                 
 filters               | null                 
 splitFromMultiAllelic | false                
 INFO_END              | null                 
 INFO_CSQ              | [{copy_number_var... 
 INFO_SVTYPE           | CNV                  
 genotypes             | [{NORMAL, 1, 2, f... 
-RECORD 2-------------------------------------
 contigName            | 19                   
 start                 | 6229693              
 end                   | 6256940              
 names                 | null                 
 referenceAllele       | C                    
 alternateAlleles      | [<CNV>]              
 qual                  | null                 
 filters               | null                 
 splitFromMultiAllelic | false                
 INFO_END              | null                 
 INFO_CSQ              | null                 
 INFO_SVTYPE           | CNV                  
 genotypes             | [{NORMAL, 1, 2, f... 
-RECORD 3-------------------------------------
 contigName            | 16                   
 start                 | 20448328             
 end                   | 20474682             
 names                 | null                 
 referenceAllele       | C                    
 alternateAlleles      | [<CNV>]              
 qual                  | null                 
 filters               | null                 
 splitFromMultiAllelic | false                
 INFO_END              | null                 
 INFO_CSQ              | null                 
 INFO_SVTYPE           | CNV                  
 genotypes             | [{NORMAL, 1, 2, f... 

It was created by parsing ascat_vep.vcf file

After some digging I was able to found, that when both fields EXON & INTRON are not fitting the schema

StructType(Seq(StructField("rank", StringType), StructField("total", StringType)))),

as they have format X-Y/Z rather than X/Z. I have changed these fields in my file - saved under ascat_annot.vcf vcf_files are here

Could You check if You can reproduce it? Is it suppouse to happen?

williambrandler commented 2 years ago

hey @PROJECT-DEFIANT what version of glow are you using?

I think this issue was fixed in a recent pull request, please take a look at this line,

https://github.com/projectglow/glow/pull/402/files#diff-e4a2d1e623585faea78c460fde0d28f131ebabd318d6a495b7e987302a8f702dR260

project-defiant commented 2 years ago

Hey, I have tried on glow.py 1.1.1 and also on glow.py 1.1.2, but the issue seems to persists across all of them.

williambrandler commented 2 years ago

We saw the same issue before, where the rank or total is represented as a range ("6-8") for indels instead of an integer (6) as it is for SNPs. Converting the schema from IntegerType to StringType resolved it.

The way we figured it out was by deleting those INFO fields from an annotated VCF and then you can read those rows without getting null for the annotation.

CNVs should be the same as indels...unless you have exposed another edge case in the schema. But the way you describe the problem it seems the same as what we have seen before.

Please confirm the version of the maven jars you are using for glow...

project-defiant commented 2 years ago

Thank You for response, It turned out I had outdated jar file

williambrandler commented 2 years ago

ah ok great, took us a couple of days to figure out that issue initially But it was never clearly documented on github, apologies for that