locationtech / rasterframes

Geospatial Raster support for Spark DataFrames
http://rasterframes.io
Apache License 2.0
243 stars 46 forks source link

How test pyrasterframes with custom hadoop version? #455

Open jdenisgiguere opened 4 years ago

jdenisgiguere commented 4 years ago

I would like to check if pyrasterframes is compatible with some version of Hadoop. My primary concern if for the s3 integration through minio.

My first attempt is to set Spark + Hadoop with environment variables and launch pyrasterframes test with: sbt pyTest

This results in 66 failed tests.

You can find the setup in https://github.com/jdenisgiguere/rasterframes-minio-ZazJXB4U/tree/master/pyrasterframes0.9-unittests-hadoop3.1.3

Should it be possible to run pyrasterframes against a custom Hadoop version? How should it be done?

vpipkt commented 4 years ago

My initial reaction would be that there is something wrong with the python environment in the container? Can you give any detail about some of the test failures that result?

jdenisgiguere commented 4 years ago

This is the error log: https://gist.github.com/jdenisgiguere/6ba4f28bbf88a61a693932b9fe251b6a

vpipkt commented 4 years ago

So several of these look like problems in the build. The error messages like this:

E                   py4j.protocol.Py4JJavaError: An error occurred while calling o1043.load.
E                   : java.lang.NoClassDefFoundError: Could not initialize class org.locationtech.rasterframes.experimental.datasource.awspds.L8CatalogDataSource$
E                       at org.locationtech.rasterframes.experimental.datasource.awspds.L8CatalogDataSource.shortName(L8CatalogDataSource.scala:38) 

indicate that the assembly jar does not have the experimental module built into it. The usual fix for that is something like sbt clean; sbt package which should recreate the assembly jar with the experimental module.

Others like this also indicate problems getting the right jars on the PySpark classpath. https://gist.github.com/jdenisgiguere/6ba4f28bbf88a61a693932b9fe251b6a#file-gistfile1-txt-L135-L136 Conceptually to fix this you would modify the pyrasterframes/src/main/python/tests/__init__.py spark_test_session method to update the spark.jars with additional jars.

jdenisgiguere commented 4 years ago

Thank you @vpipkt ! My first attempts to use the spark.jars instead of the SPARK_DIST_CLASSPATH to specify Hadoop configuration were not sucessful. I am trying to compile Spark 2.4.4 with Hadoop 3.1.3 instead. This is no ideal but could be a workaround.

jdenisgiguere commented 4 years ago

I succeed to run the test with a custom spark build with Hadoop 3.1.

So, currently, we cannot use a build of spark without Hadoop and the SPARK_DIST_CLASSPATH to run the test but it works with a custom build of Spark 2.4.4 with another version of Hadoop than 2.7.

@vpipkt , I let you close if you find this workaround satisfying.