projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
262 stars 107 forks source link

spark.read.format("vcf").load() fails for vcf.bgz files #523

Closed jakub-auger closed 1 year ago

jakub-auger commented 1 year ago

I have an intermittent issues on databricks. I'm not able to load bgz files 90% of the time. I get the following error

org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 18.0 failed 4 times, most recent failure: Lost task 3.3 in stage 18.0 (TID 72) (10.139.64.4 executor driver): htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file

system config: tried both the 1.2.1 docker image on a 10.4LTS instance and a manually configured 10.4LTS cluster where I manually installed the maven package

steps to reproduce

pip install glow.py

import glow
glow.register(spark)

import os
from pyspark.sql.functions import *

input_vcf_path = 'wasb://gnomad@azureopendatastorage.blob.core.windows.net/release/3.1/vcf/genomes/gnomad.genomes.v3.1.sites.chr21.vcf.bgz'

vcf_df = spark.read.format("vcf").load(input_vcf_path)

expected output

a valid df

actual output

error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 18.0 failed 4 times, most recent failure: Lost task 3.3 in stage 18.0 (TID 72) (10.139.64.4 executor driver): htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file

further info works fine with decrypted VCFs and gz (eg input_vcf_path = 'wasb://gnomad@azureopendatastorage.blob.core.windows.net/release/2.0.1/vcf/genomes/gnomad.genomes.r2.0.1.sites.chr15.vcf.gz')

so it looks like the the bgzip codec that's not working? Have I missed a step

I tried to RDD.take(10) the vcf.bgz file and I get gibberish whereas if I rdd.take(10) from the vcf.gz file I get the actual text

williambrandler commented 1 year ago

@jakub-auger To register glow, which is necessary to read block gzipped VCFs, write it like this:

import glow
spark = glow.register(spark)

instead of, glow.register(spark)

I think this is the issue, please try again and confirm you can read the VCF

here is the relevant docs page https://glow.readthedocs.io/en/latest/getting-started.html

jakub-auger commented 1 year ago

thanks, dont know how i missed it!

williambrandler commented 1 year ago

often requires another pair of eyes for subtle things like this!