Closed edg1983 closed 9 months ago
Hey @edg1983, reading bgen's that are indexed with bgenix should work fine on cloud object storage You do not need to read the index itself, just the bgen like you did before, but ensure that the index is on the same path.
Please provide more details about the environment are you running glow in
Hi, I'm testing glow on a local Spark stand-alone implementation (especially we are interested in the GWAS pipeline) and everything else worked fine so far. Essentially, I initialize a SparkSession with pyspark using local[24] as master and additional packages for delta and glow.
I'm reading the BGEN file directly as you suggested, but when a BGI index is present I get the error reported above, from which it seems that there is some issue reading the index file.
ah ok thanks,
Here is the offending line of code: https://github.com/projectglow/glow/blob/8b0bcd6b2f7320c3a5bd186bdcfa4707af303b47/core/src/main/scala/io/projectglow/bgen/BgenFileFormat.scala#L149
It is using SQLite to access the index. But cannot find the sqlite classes and driver on the class path. Do you have the sqlite jdbc jar on your stand-alone implementation of spark? Here is a similar issue on stack overflow:
https://stackoverflow.com/questions/16725377/no-suitable-driver-found-sqlite
@edg1983 were you able to resolve this?
Glow depends on sqlite-jdbc 3.20.1
Hi! Apologise for the late reply... In the end I've built a container with all Spark and Python dependencies and it works now!
Thanks!
thanks @edg1983,
Can we work together to contribute this container back to Glow?
It should be straightforward as we already have a container for running Glow in Databricks and a dockerhub subscription.
This would benefit the community as we have had requests to make it easier to run Glow in a container
Hi, our main interest is using GLOW to run regenie GWAS algorithm at scale using Spark implementation provided in the GloWGR pipeline. So I've made a container based on the datamechanics docker image for Spark ( gcr.io/datamechanics/spark:3.1.2-hadoop-3.2.0-java-11-scala-2.12-python-3.8-dm16), integrated with additional jar and python dependencies for GLOW.
The idea is to use this docker image to deploy the system at scale using kubernetes so that we can adapt easily for local run on our HPC as well as cloud run on UKBB RAP or other cloud platforms.
The image I've optimized so far is available in DockerHub as edg1983/glowgr-spark:v1 and it packed spark and GLOW. Essentially, you can run any python script containing GLOW analysis in a stand-alone mode using something like the following command (it's singularity because we can not run docker on our HPC, but it's using the same image) with a test.py
script that then initialize spark config.
singularity run \
--bind /your/output/path:/output \
--bind /your/input/path:/input \
--bind /path/to/python/script:/opt/application \
--bind /path/to/tmp/dir:/spark_tmp \
glowgr-spark_v1.sif \
driver --driver-memory 120G \
local:///opt/application/test.py
It worked fine in my tests so far and we are now working to make it run on kubernetes. Feel free to test it more and let me know if this can be of interest.
This is great, thanks. Would like to translate this into something anyone can use
Do you have the dockerfiles in a repo that I could look at and see if we can contribute back to glow?
Thanks
This is the Dockerfile I'm using right now. Feel free to improve and/or re-distribute this as long as my contribution is properly acknowledged. Dockerfile_glow.zip
thanks @edg1983 , working on a container here, https://github.com/projectglow/glow/pull/503
please could you test projectglow/open-source-glow:1.1.2
https://hub.docker.com/r/projectglow/open-source-glow/tags
to see if it works the same as your container?
Acknowledged you in the documentation
Thanks!
Closing this it seems the issue was resolved.
Hi,
My setup is Spark v3.1.3 with Hadoop v3.2.2 and PySpark v3.1.2 and latest glow.py.
I'm trying to read some BGEN files into GLOW from my a jupyter notebook. I can read the input file using the suggested command when no bgi index is present
I've then generated BGI index for my BGEN files using bgenix tool v.1.1.7 since the presence of BGI index is supposed to increase import performances. However, if I try to load the a BGEN file for which I have the index. the process fails with the following error related to reading the BGI index file.
How can I fix this?
Thanks!