Open jdenisgiguere opened 4 years ago
My initial reaction would be that there is something wrong with the python environment in the container? Can you give any detail about some of the test failures that result?
This is the error log: https://gist.github.com/jdenisgiguere/6ba4f28bbf88a61a693932b9fe251b6a
So several of these look like problems in the build. The error messages like this:
E py4j.protocol.Py4JJavaError: An error occurred while calling o1043.load.
E : java.lang.NoClassDefFoundError: Could not initialize class org.locationtech.rasterframes.experimental.datasource.awspds.L8CatalogDataSource$
E at org.locationtech.rasterframes.experimental.datasource.awspds.L8CatalogDataSource.shortName(L8CatalogDataSource.scala:38)
indicate that the assembly jar does not have the experimental
module built into it. The usual fix for that is something like sbt clean; sbt package
which should recreate the assembly jar with the experimental
module.
Others like this also indicate problems getting the right jars on the PySpark classpath. https://gist.github.com/jdenisgiguere/6ba4f28bbf88a61a693932b9fe251b6a#file-gistfile1-txt-L135-L136 Conceptually to fix this you would modify the pyrasterframes/src/main/python/tests/__init__.py
spark_test_session
method to update the spark.jars
with additional jars.
Thank you @vpipkt !
My first attempts to use the spark.jars
instead of the SPARK_DIST_CLASSPATH
to specify Hadoop configuration were not sucessful. I am trying to compile Spark 2.4.4 with Hadoop 3.1.3 instead. This is no ideal but could be a workaround.
I succeed to run the test with a custom spark build with Hadoop 3.1.
So, currently, we cannot use a build of spark without Hadoop and the SPARK_DIST_CLASSPATH
to run the test but it works with a custom build of Spark 2.4.4 with another version of Hadoop than 2.7.
@vpipkt , I let you close if you find this workaround satisfying.
I would like to check if pyrasterframes is compatible with some version of Hadoop. My primary concern if for the s3 integration through minio.
My first attempt is to set Spark + Hadoop with environment variables and launch pyrasterframes test with:
sbt pyTest
This results in 66 failed tests.
You can find the setup in https://github.com/jdenisgiguere/rasterframes-minio-ZazJXB4U/tree/master/pyrasterframes0.9-unittests-hadoop3.1.3
Should it be possible to run pyrasterframes against a custom Hadoop version? How should it be done?