projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
262 stars 107 forks source link

Cannot write INFO fields with LongType to VCF #525

Closed Hoeze closed 3 months ago

Hoeze commented 1 year ago

Example:

(
    spark.read.parquet(INPUT_PATH)
    .select(
        f.col("chrom").alias("contigName"),
        f.col("start"),
        f.col("end"),
        f.col("ref").alias("referenceAllele"),
        f.array(f.col("alt")).alias("alternateAlleles"),
        f.col("INFO_SVTYPE"),
        f.col("INFO_END").astype(t.LongType()),
    )
    .write
    .format("vcf")
    .save(OUTPUT_PATH, mode="overwrite")
)

Fails with:

23/02/01 16:25:19 ERROR Executor: Exception in task 14.0 in stage 30.0 (TID 2513)
scala.MatchError: LongType (of class org.apache.spark.sql.types.LongType$)
    at io.projectglow.vcf.VCFSchemaInferrer$.vcfDataType(VCFSchemaInferrer.scala:181)
    at io.projectglow.vcf.VCFSchemaInferrer$.$anonfun$headerLinesFromSchema$2(VCFSchemaInferrer.scala:118)
    at scala.collection.immutable.List.map(List.scala:297)
    at io.projectglow.vcf.VCFSchemaInferrer$.headerLinesFromSchema(VCFSchemaInferrer.scala:116)
    at io.projectglow.vcf.VCFHeaderUtils$.parseHeaderLinesAndSamples(VCFHeaderUtils.scala:74)
    at io.projectglow.vcf.VCFOutputWriterFactory.newInstance(VCFFileFormat.scala:504)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:146)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:290)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$16(FileFormatWriter.scala:229)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
williambrandler commented 1 year ago

Hi @Hoeze

I had the same issue, the VCF writer does not support LongType() for INFO fields The workaround is to cast to LongType() INFO fields to IntegerType()

e.g.

from pyspark.sql.types import *
import pyspark.sql.functions as fx

vcf_df = vcf_df.withColumn("INFO_test", fx.col("INFO_test").cast(IntegerType())
henrydavidge commented 5 months ago

I'll see if I can fix this.