projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
262 stars 106 forks source link

Regular updates to work with cloud providers #534

Closed doit-mattporter closed 5 months ago

doit-mattporter commented 10 months ago

Are there plans to regularly make this compatible with the latest versions of PySpark on AWS EMR and GCP Dataproc? Having the most up-to-date Glow version restricted to spark3_2.12 is pretty limiting. While it presently will work on the latest version of Dataproc, trying to run Glow on multiple recent image versions of AWS EMR all fail to run DataFrame operations once the VCF has been read in by Glow. The error thrown is always the following:

java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.filePath()Ljava/lang/String;
        at io.projectglow.vcf.VCFFileFormat.$anonfun$buildReader$8(VCFFileFormat.scala:175)
        at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
        at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:348)
...
henrydavidge commented 5 months ago

@doit-mattporter I've merged support for Spark 3.4 and 3.5. We'll release new artifacts soon. Feel free to build artifacts from source if you'd care to try it before then.

The next release will also include a Scala 2.13 binary.