Closed dridk closed 1 year ago
are you running locally or in the cloud?
Try starting the spark shell with the glow maven package
https://glow.readthedocs.io/en/latest/getting-started.html#running-locally
locally. I cannot run the following command because I have not access to bash. I only have a jupyter notebook.
./bin/pyspark --packages io.projectglow:glow-spark3_2.12:1.2.1 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec
An example of glow running on google collab would be helpful
can you execute terminal commands from your jupyter notebook? https://stackoverflow.com/questions/38694081/executing-terminal-commands-in-jupyter-notebook
@dridk You can use this snippet:
spark = (
SparkSession.builder
.appName('your_app')
.config("spark.jars.packages", ",".join([
"io.projectglow:glow-spark3_2.12:1.2.1",
]))
# .config("spark.local.dir", os.environ.get("TEMPDIR"))
# .config('spark.sql.caseSensitive', "true")
.config("spark.hadoop.io.compression.codecs", "io.projectglow.sql.util.BGZFCodec")
.getOrCreate()
)
glow.register(spark)
spark
Make sure your PySpark version is v3.2.x in this case :)
Thank you ! I will try and let you know if it works !
Hi,
I am new with Pyspark and I would like to test glow to load a VCF file. Actually, I can use pyspark without starting spark manually from the command line. It looks like it start automatically from Python. As I know, all spark binaries comes with pyspark pip installation.
So, if I run the following code :
It returns the following error :
I suppose I have to run spark with the following line . But I have no idea how to add this package dependencies from SparkSession in python code.
--packages io.projectglow:glow-spark3_2.12:1.2.1 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec