projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
262 stars 107 forks source link

Installing and running glow outside of databricks #494

Closed edg1983 closed 5 months ago

edg1983 commented 2 years ago

Hi,

My setup is Spark v3.1.3 with Hadoop v3.2.2 and PySpark v3.1.2 and latest glow.py.

I'm trying to read some BGEN files into GLOW from my a jupyter notebook. I can read the input file using the suggested command when no bgi index is present

variants_df = spark.read.format("bgen").load(input_bgen)

I've then generated BGI index for my BGEN files using bgenix tool v.1.1.7 since the presence of BGI index is supposed to increase import performances. However, if I try to load the a BGEN file for which I have the index. the process fails with the following error related to reading the BGI index file.

org.skife.jdbi.v2.exceptions.UnableToObtainConnectionException: java.sql.SQLException: No suitable driver found for jdbc:sqlite:/tmp/bgen_indices/ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.stdChrs.bgen.bgi
    at org.skife.jdbi.v2.DBI.open(DBI.java:230)
    at io.projectglow.bgen.BgenFileFormat.nextVariantIndex(BgenFileFormat.scala:150)
    at io.projectglow.bgen.BgenFileFormat.$anonfun$buildReader$4(BgenFileFormat.scala:94)
    at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:147)
    at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:132)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.sql.SQLException: No suitable driver found for jdbc:sqlite:/tmp/bgen_indices/ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.stdChrs.bgen.bgi
    at java.sql.DriverManager.getConnection(DriverManager.java:689)
    at java.sql.DriverManager.getConnection(DriverManager.java:270)
    at org.skife.jdbi.v2.DBI$1.openConnection(DBI.java:103)
    at org.skife.jdbi.v2.DBI.open(DBI.java:212)
    ... 24 more

How can I fix this?

Thanks!

williambrandler commented 2 years ago

Hey @edg1983, reading bgen's that are indexed with bgenix should work fine on cloud object storage You do not need to read the index itself, just the bgen like you did before, but ensure that the index is on the same path.

Please provide more details about the environment are you running glow in

edg1983 commented 2 years ago

Hi, I'm testing glow on a local Spark stand-alone implementation (especially we are interested in the GWAS pipeline) and everything else worked fine so far. Essentially, I initialize a SparkSession with pyspark using local[24] as master and additional packages for delta and glow.

I'm reading the BGEN file directly as you suggested, but when a BGI index is present I get the error reported above, from which it seems that there is some issue reading the index file.

williambrandler commented 2 years ago

ah ok thanks,

Here is the offending line of code: https://github.com/projectglow/glow/blob/8b0bcd6b2f7320c3a5bd186bdcfa4707af303b47/core/src/main/scala/io/projectglow/bgen/BgenFileFormat.scala#L149

It is using SQLite to access the index. But cannot find the sqlite classes and driver on the class path. Do you have the sqlite jdbc jar on your stand-alone implementation of spark? Here is a similar issue on stack overflow:

https://stackoverflow.com/questions/16725377/no-suitable-driver-found-sqlite

williambrandler commented 2 years ago

@edg1983 were you able to resolve this?

Glow depends on sqlite-jdbc 3.20.1

edg1983 commented 2 years ago

Hi! Apologise for the late reply... In the end I've built a container with all Spark and Python dependencies and it works now!

Thanks!

williambrandler commented 2 years ago

thanks @edg1983,

Can we work together to contribute this container back to Glow?

It should be straightforward as we already have a container for running Glow in Databricks and a dockerhub subscription.

This would benefit the community as we have had requests to make it easier to run Glow in a container

edg1983 commented 2 years ago

Hi, our main interest is using GLOW to run regenie GWAS algorithm at scale using Spark implementation provided in the GloWGR pipeline. So I've made a container based on the datamechanics docker image for Spark ( gcr.io/datamechanics/spark:3.1.2-hadoop-3.2.0-java-11-scala-2.12-python-3.8-dm16), integrated with additional jar and python dependencies for GLOW.

The idea is to use this docker image to deploy the system at scale using kubernetes so that we can adapt easily for local run on our HPC as well as cloud run on UKBB RAP or other cloud platforms. The image I've optimized so far is available in DockerHub as edg1983/glowgr-spark:v1 and it packed spark and GLOW. Essentially, you can run any python script containing GLOW analysis in a stand-alone mode using something like the following command (it's singularity because we can not run docker on our HPC, but it's using the same image) with a test.py script that then initialize spark config.

singularity run \
    --bind /your/output/path:/output \
    --bind /your/input/path:/input \
    --bind /path/to/python/script:/opt/application \
    --bind /path/to/tmp/dir:/spark_tmp \
    glowgr-spark_v1.sif \
    driver --driver-memory 120G \
    local:///opt/application/test.py

It worked fine in my tests so far and we are now working to make it run on kubernetes. Feel free to test it more and let me know if this can be of interest.

williambrandler commented 2 years ago

This is great, thanks. Would like to translate this into something anyone can use

Do you have the dockerfiles in a repo that I could look at and see if we can contribute back to glow?

Thanks

edg1983 commented 2 years ago

This is the Dockerfile I'm using right now. Feel free to improve and/or re-distribute this as long as my contribution is properly acknowledged. Dockerfile_glow.zip

williambrandler commented 2 years ago

thanks @edg1983 , working on a container here, https://github.com/projectglow/glow/pull/503

please could you test projectglow/open-source-glow:1.1.2 https://hub.docker.com/r/projectglow/open-source-glow/tags

to see if it works the same as your container?

Acknowledged you in the documentation

Thanks!

henrydavidge commented 5 months ago

Closing this it seems the issue was resolved.