Closed jakub-auger closed 1 year ago
@jakub-auger To register glow, which is necessary to read block gzipped VCFs, write it like this:
import glow
spark = glow.register(spark)
instead of, glow.register(spark)
I think this is the issue, please try again and confirm you can read the VCF
here is the relevant docs page https://glow.readthedocs.io/en/latest/getting-started.html
thanks, dont know how i missed it!
often requires another pair of eyes for subtle things like this!
I have an intermittent issues on databricks. I'm not able to load bgz files 90% of the time. I get the following error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 18.0 failed 4 times, most recent failure: Lost task 3.3 in stage 18.0 (TID 72) (10.139.64.4 executor driver): htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
system config: tried both the 1.2.1 docker image on a 10.4LTS instance and a manually configured 10.4LTS cluster where I manually installed the maven package
steps to reproduce
expected output
a valid df
actual output
error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 18.0 failed 4 times, most recent failure: Lost task 3.3 in stage 18.0 (TID 72) (10.139.64.4 executor driver): htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
further info works fine with decrypted VCFs and gz (eg input_vcf_path = 'wasb://gnomad@azureopendatastorage.blob.core.windows.net/release/2.0.1/vcf/genomes/gnomad.genomes.r2.0.1.sites.chr15.vcf.gz')
so it looks like the the bgzip codec that's not working? Have I missed a step
I tried to RDD.take(10) the vcf.bgz file and I get gibberish whereas if I rdd.take(10) from the vcf.gz file I get the actual text