locationtech / rasterframes

Geospatial Raster support for Spark DataFrames
http://rasterframes.io
Apache License 2.0
243 stars 46 forks source link

Can't load saved model: ImportError: No module named 'astraea' #83

Open rbavery opened 5 years ago

rbavery commented 5 years ago

I can save a model to s3 but can't load it with the following:

from pyspark.ml import PipelineModel
model.save("s3://activemapper/test.model")
model_load = PipelineModel.load("s3://activemapper/test.model")

Documentation indicates this is the way to load and use past models: https://stackoverflow.com/questions/50587744/how-to-save-the-model-after-doing-pipeline-fit

The error is

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-15-a0eb67c190e1> in <module>()
----> 1 model_load = PipelineModel.load("s3://activemapper/test.model")

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py in load(cls, path)
    255     def load(cls, path):
    256         """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 257         return cls.read().load(path)
    258 
    259 

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py in load(self, path)
    199             raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"
    200                                       % self._clazz)
--> 201         return self._clazz._from_java(java_obj)
    202 
    203     def context(self, sqlContext):

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py in _from_java(cls, java_stage)
    230         """
    231         # Load information from java_stage to the instance.
--> 232         py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
    233         # Create a new instance of this stage.
    234         py_stage = cls(py_stages)

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py in <listcomp>(.0)
    230         """
    231         # Load information from java_stage to the instance.
--> 232         py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
    233         # Create a new instance of this stage.
    234         py_stage = cls(py_stages)

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py in _from_java(java_stage)
    200         stage_name = java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
    201         # Generate a default new instance from the stage_name class.
--> 202         py_type = __get_class(stage_name)
    203         if issubclass(py_type, JavaParams):
    204             # Load information from java_stage to the instance.

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py in __get_class(clazz)
    194             parts = clazz.split('.')
    195             module = ".".join(parts[:-1])
--> 196             m = __import__(module)
    197             for comp in parts[1:]:
    198                 m = getattr(m, comp)
ImportError: No module named 'astraea'

pyspark raises this error higher in the chain: NotImplementedError("This Java ML type cannot be loaded into Python currently Any help is appreciated!

metasim commented 5 years ago

@rbavery Sorry for the delay in responding to this. Can you confirm that before you tried to load the model back in that you've called the withRasterFrames() method on the SparkSession? My initial guess is that it's behaving as if the JVM library isn't loaded.

Are you willing to share the JSON component of the saved workflow?

rbavery commented 5 years ago

Hi @metasim thanks for the reply. I definitely called .withRasterFrames, here is what that section looks like

conf = gps\
           .geopyspark_conf(appName="CVML Model Iteration", master="yarn") \
           .set("spark.dynamicAllocation.enabled", True) \
           .set("spark.ui.enabled", True) \
           .set("spark.hadoop.yarn.timeline-service.enabled", False)
    spark = SparkSession\
            .builder\
            .config(conf=conf)\
            .getOrCreate()\
            .withRasterFrames()
    logger = spark._jvm.org.apache.log4j.Logger.getLogger("Main")

Attached is the saved workflow model_pipeline.zip

metasim commented 5 years ago

@rbavery Yeh, the saved workflow looks fine. Have you tried loading it in local mode? I'm wondering if it's a YARN deployment thing.... the requisite jar and/or zip not getting delivered properly.

I'm inclined to want to ask @jbouffard on Gitter if he has seen this error before, as he's the person that seems to have done the most with pyrasterframes outside of my organization (where we may have inadvertently made assumptions about runtime environment).

rbavery commented 5 years ago

@metasim haven't tried local mode, would I set it like this, keeping everything else constant?

conf = gps\
           .geopyspark_conf(appName="CVML Model Iteration", master="local") \
           .set("spark.dynamicAllocation.enabled", True) \
           .set("spark.ui.enabled", True) \
           .set("spark.hadoop.yarn.timeline-service.enabled", False)
    spark = SparkSession\
            .builder\
            .config(conf=conf)\
            .getOrCreate()\
            .withRasterFrames()
    logger = spark._jvm.org.apache.log4j.Logger.getLogger("Main")

Thanks for the help, I appreciate it.

rbavery commented 5 years ago

@metasim Heads up I got a slightly different error when trying the above, just switching master="yarn" to master=''local"

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-3-2113b4b77176> in <module>()
      1 from pyspark.ml import PipelineModel
----> 2 model_load = PipelineModel.load("s3://activemapper/test.model")

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py in load(cls, path)
    255     def load(cls, path):
    256         """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 257         return cls.read().load(path)
    258 
    259 

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py in load(self, path)
    199             raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"
    200                                       % self._clazz)
--> 201         return self._clazz._from_java(java_obj)
    202 
    203     def context(self, sqlContext):

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py in _from_java(cls, java_stage)
    230         """
    231         # Load information from java_stage to the instance.
--> 232         py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
    233         # Create a new instance of this stage.
    234         py_stage = cls(py_stages)

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py in <listcomp>(.0)
    230         """
    231         # Load information from java_stage to the instance.
--> 232         py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
    233         # Create a new instance of this stage.
    234         py_stage = cls(py_stages)

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py in _from_java(java_stage)
    200         stage_name = java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
    201         # Generate a default new instance from the stage_name class.
--> 202         py_type = __get_class(stage_name)
    203         if issubclass(py_type, JavaParams):
    204             # Load information from java_stage to the instance.

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py in __get_class(clazz)
    194             parts = clazz.split('.')
    195             module = ".".join(parts[:-1])
--> 196             m = __import__(module)
    197             for comp in parts[1:]:
    198                 m = getattr(m, comp)

ImportError: No module named 'astraea.spark'
metasim commented 5 years ago

@rbavery I'm sorry this continues to be a problem. Would you be able to run the following before you try to load the model and confirm that there's a rasterframes jar file in the classpath?

spark._jvm.java.lang.System.getProperty('java.class.path')

If not, I'd need to know more about the difference between your training vs. scoring environment, to determine why RasterFrames was in the classspath in the former case, but not the latter.

rbavery commented 5 years ago

Looks like there is not. I ran the following

probability_images = 3 # Write this many probability images to location specified in YAML file
seed = 42 #Use this seed for sampling probability images
complete_catalog = False #Produce the entire catalog of probability images
s3_bucket = 'activemapper'
run_id = 1
aoi_name = 'ghana-aoi4-test'

start_time = time.time()
conf = gps\
       .geopyspark_conf(appName="CVML Model Iteration", master="local") \
       .set("spark.dynamicAllocation.enabled", True) \
       .set("spark.ui.enabled", True) \
       .set("spark.hadoop.yarn.timeline-service.enabled", False)
spark = SparkSession\
        .builder\
        .config(conf=conf)\
        .getOrCreate()\
        .withRasterFrames()
logger = spark._jvm.org.apache.log4j.Logger.getLogger("Main")
spark._jvm.java.lang.System.getProperty('java.class.path')

and got

'/usr/lib/hadoop-lzo/lib/hadoop-lzo.jar:/usr/lib/hadoop-lzo/lib/hadoop-lzo-0.4.19.jar:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-codedeploy-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-ssm-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudformation-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-xray-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-servicediscovery-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-mobile-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-polly-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-servermigration-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-devicefarm-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudhsmv2-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-test-utils-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloud9-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-workmail-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-workspaces-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-emr-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-appstream-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-machinelearning-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-organizations-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-rekognition-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-discovery-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-api-gateway-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-appsync-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-lightsail-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudtrail-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-kinesis-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-acm-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-codestar-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-elasticache-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-route53-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-elasticloadbalancingv2-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-support-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudsearch-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudhsm-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-directconnect-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-codecommit-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-iot-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-rds-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-storagegateway-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-kinesisvideo-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-redshift-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-iotjobsdataplane-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-mq-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-lambda-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-s3-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-shield-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-resourcegroups-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-ecr-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-costandusagereport-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-efs-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-codegen-maven-plugin-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-budgets-1.11.267.jar:/usr/share/aws/aws-java-sdk/jmespath-java-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-pricing-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cognitoidentity-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-models-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-glacier-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-serverlessapplicationrepository-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-gamelift-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-directory-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-batch-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-lex-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-elastictranscoder-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-mediastore-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-health-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-comprehend-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-iam-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-autoscaling-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-ses-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-guardduty-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cognitoidp-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-elasticsearch-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-simpledb-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-config-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-dynamodb-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-sagemakerruntime-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-marketplaceentitlement-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-waf-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-stepfunctions-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-greengrass-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-importexport-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-kms-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-core-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-resourcegroupstaggingapi-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-athena-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-datapipeline-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-sqs-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-glue-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-migrationhub-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudwatchmetrics-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-sts-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-marketplacecommerceanalytics-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-sagemaker-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-costexplorer-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-mechanicalturkrequester-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-mediaconvert-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-dms-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-snowball-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-dax-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-ec2-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-opsworks-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-mediastoredata-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-code-generator-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-logs-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-lexmodelbuilding-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-elasticbeanstalk-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-alexaforbusiness-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-workdocs-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-opsworkscm-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-simpleworkflow-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudwatch-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-marketplacemeteringservice-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-translate-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-clouddirectory-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-events-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-mediapackage-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-autoscalingplans-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-codebuild-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-inspector-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-sns-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-servicecatalog-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-pinpoint-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudfront-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-opensdk-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-applicationautoscaling-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-ecs-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-medialive-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cognitosync-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-codepipeline-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-elasticloadbalancing-1.11.267.jar:/usr/share/aws/emr/emrfs/conf/:/usr/share/aws/emr/emrfs/lib/jcl-over-slf4j-1.7.21.jar:/usr/share/aws/emr/emrfs/lib/bcpkix-jdk15on-1.51.jar:/usr/share/aws/emr/emrfs/lib/jmespath-java-1.11.267.jar:/usr/share/aws/emr/emrfs/lib/ion-java-1.0.2.jar:/usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.21.0.jar:/usr/share/aws/emr/emrfs/lib/bcprov-jdk15on-1.51.jar:/usr/share/aws/emr/emrfs/lib/javax.inject-1.jar:/usr/share/aws/emr/emrfs/lib/slf4j-api-1.7.21.jar:/usr/share/aws/emr/emrfs/lib/aopalliance-1.0.jar:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/lib/spark/conf/:/usr/lib/spark/jars/avro-1.7.7.jar:/usr/lib/spark/jars/htrace-core4-4.0.1-incubating.jar:/usr/lib/spark/jars/guice-3.0.jar:/usr/lib/spark/jars/jul-to-slf4j-1.7.16.jar:/usr/lib/spark/jars/hive-cli-1.2.1-spark2-amzn-0.jar:/usr/lib/spark/jars/spark-network-common_2.11-2.2.1.jar:/usr/lib/spark/jars/scala-library-2.11.8.jar:/usr/lib/spark/jars/spark-repl_2.11-2.2.1.jar:/usr/lib/spark/jars/gson-2.2.4.jar:/usr/lib/spark/jars/janino-3.0.0.jar:/usr/lib/spark/jars/commons-logging-1.1.3.jar:/usr/lib/spark/jars/hadoop-common-2.8.3-amzn-0.jar:/usr/lib/spark/jars/commons-lang-2.6.jar:/usr/lib/spark/jars/RoaringBitmap-0.5.11.jar:/usr/lib/spark/jars/ivy-2.4.0.jar:/usr/lib/spark/jars/machinist_2.11-0.6.1.jar:/usr/lib/spark/jars/libthrift-0.9.3.jar:/usr/lib/spark/jars/javax.ws.rs-api-2.0.1.jar:/usr/lib/spark/jars/metrics-core-3.1.2.jar:/usr/lib/spark/jars/pyrolite-4.13.jar:/usr/lib/spark/jars/jetty-6.1.26-emr.jar:/usr/lib/spark/jars/spark-network-shuffle_2.11-2.2.1.jar:/usr/lib/spark/jars/hadoop-mapreduce-client-common-2.8.3-amzn-0.jar:/usr/lib/spark/jars/api-asn1-api-1.0.0-M20.jar:/usr/lib/spark/jars/jersey-common-2.22.2.jar:/usr/lib/spark/jars/py4j-0.10.4.jar:/usr/lib/spark/jars/activation-1.1.1.jar:/usr/lib/spark/jars/mail-1.4.7.jar:/usr/lib/spark/jars/jackson-module-paranamer-2.6.7.jar:/usr/lib/spark/jars/json-smart-1.1.1.jar:/usr/lib/spark/jars/xz-1.0.jar:/usr/lib/spark/jars/spark-sketch_2.11-2.2.1.jar:/usr/lib/spark/jars/slf4j-api-1.7.16.jar:/usr/lib/spark/jars/guava-14.0.1.jar:/usr/lib/spark/jars/json4s-jackson_2.11-3.2.11.jar:/usr/lib/spark/jars/univocity-parsers-2.2.1.jar:/usr/lib/spark/jars/core-1.1.2.jar:/usr/lib/spark/jars/jetty-util-6.1.26-emr.jar:/usr/lib/spark/jars/antlr-runtime-3.4.jar:/usr/lib/spark/jars/hadoop-mapreduce-client-jobclient-2.8.3-amzn-0.jar:/usr/lib/spark/jars/jersey-media-jaxb-2.22.2.jar:/usr/lib/spark/jars/hadoop-yarn-client-2.8.3-amzn-0.jar:/usr/lib/spark/jars/objenesis-2.1.jar:/usr/lib/spark/jars/commons-dbcp-1.4.jar:/usr/lib/spark/jars/jodd-core-3.5.2.jar:/usr/lib/spark/jars/spark-unsafe_2.11-2.2.1.jar:/usr/lib/spark/jars/lz4-1.3.0.jar:/usr/lib/spark/jars/spark-hive-thriftserver_2.11-2.2.1.jar:/usr/lib/spark/jars/calcite-linq4j-1.2.0-incubating.jar:/usr/lib/spark/jars/json4s-ast_2.11-3.2.11.jar:/usr/lib/spark/jars/jackson-core-asl-1.9.13.jar:/usr/lib/spark/jars/spark-sql_2.11-2.2.1.jar:/usr/lib/spark/jars/parquet-common-1.8.2.jar:/usr/lib/spark/jars/jersey-client-2.22.2.jar:/usr/lib/spark/jars/ST4-4.0.4.jar:/usr/lib/spark/jars/oro-2.0.8.jar:/usr/lib/spark/jars/jdo-api-3.0.1.jar:/usr/lib/spark/jars/parquet-hadoop-1.8.2.jar:/usr/lib/spark/jars/scala-compiler-2.11.8.jar:/usr/lib/spark/jars/jackson-mapper-asl-1.9.13.jar:/usr/lib/spark/jars/compress-lzf-1.0.3.jar:/usr/lib/spark/jars/metrics-ganglia-3.1.2.jar:/usr/lib/spark/jars/stax-api-1.0.1.jar:/usr/lib/spark/jars/xmlenc-0.52.jar:/usr/lib/spark/jars/avro-mapred-1.7.7-hadoop2.jar:/usr/lib/spark/jars/oncrpc-1.0.7.jar:/usr/lib/spark/jars/shapeless_2.11-2.3.2.jar:/usr/lib/spark/jars/eigenbase-properties-1.1.5.jar:/usr/lib/spark/jars/commons-collections-3.2.2.jar:/usr/lib/spark/jars/super-csv-2.2.0.jar:/usr/lib/spark/jars/netty-3.9.9.Final.jar:/usr/lib/spark/jars/commons-lang3-3.5.jar:/usr/lib/spark/jars/javassist-3.18.1-GA.jar:/usr/lib/spark/jars/spark-yarn_2.11-2.2.1.jar:/usr/lib/spark/jars/curator-client-2.6.0.jar:/usr/lib/spark/jars/jcl-over-slf4j-1.7.16.jar:/usr/lib/spark/jars/JavaEWAH-0.3.2.jar:/usr/lib/spark/jars/spark-core_2.11-2.2.1.jar:/usr/lib/spark/jars/metrics-jvm-3.1.2.jar:/usr/lib/spark/jars/metrics-graphite-3.1.2.jar:/usr/lib/spark/jars/chill-java-0.8.0.jar:/usr/lib/spark/jars/commons-configuration-1.6.jar:/usr/lib/spark/jars/jersey-container-servlet-2.22.2.jar:/usr/lib/spark/jars/commons-codec-1.10.jar:/usr/lib/spark/jars/commons-compiler-3.0.0.jar:/usr/lib/spark/jars/jackson-module-scala_2.11-2.6.7.1.jar:/usr/lib/spark/jars/antlr4-runtime-4.5.3.jar:/usr/lib/spark/jars/base64-2.3.8.jar:/usr/lib/spark/jars/javax.annotation-api-1.2.jar:/usr/lib/spark/jars/nimbus-jose-jwt-3.9.jar:/usr/lib/spark/jars/gmetric4j-1.0.7.jar:/usr/lib/spark/jars/commons-digester-1.8.jar:/usr/lib/spark/jars/parquet-encoding-1.8.2.jar:/usr/lib/spark/jars/breeze-macros_2.11-0.13.2.jar:/usr/lib/spark/jars/javolution-5.5.1.jar:/usr/lib/spark/jars/avro-ipc-1.7.7.jar:/usr/lib/spark/jars/spire_2.11-0.13.0.jar:/usr/lib/spark/jars/jackson-core-2.6.7.jar:/usr/lib/spark/jars/hk2-utils-2.4.0-b34.jar:/usr/lib/spark/jars/leveldbjni-all-1.8.jar:/usr/lib/spark/jars/hk2-api-2.4.0-b34.jar:/usr/lib/spark/jars/snappy-java-1.1.2.6.jar:/usr/lib/spark/jars/jersey-container-servlet-core-2.22.2.jar:/usr/lib/spark/jars/curator-framework-2.6.0.jar:/usr/lib/spark/jars/zookeeper-3.4.6.jar:/usr/lib/spark/jars/mx4j-3.0.2.jar:/usr/lib/spark/jars/paranamer-2.6.jar:/usr/lib/spark/jars/scala-xml_2.11-1.0.2.jar:/usr/lib/spark/jars/guice-servlet-3.0.jar:/usr/lib/spark/jars/osgi-resource-locator-1.0.1.jar:/usr/lib/spark/jars/chill_2.11-0.8.0.jar:/usr/lib/spark/jars/antlr-2.7.7.jar:/usr/lib/spark/jars/okio-1.4.0.jar:/usr/lib/spark/jars/commons-math3-3.4.1.jar:/usr/lib/spark/jars/bonecp-0.8.0.RELEASE.jar:/usr/lib/spark/jars/jackson-annotations-2.6.7.jar:/usr/lib/spark/jars/hadoop-hdfs-client-2.8.3-amzn-0.jar:/usr/lib/spark/jars/jackson-dataformat-cbor-2.6.7.jar:/usr/lib/spark/jars/scala-reflect-2.11.8.jar:/usr/lib/spark/jars/jersey-guava-2.22.2.jar:/usr/lib/spark/jars/java-xmlbuilder-1.0.jar:/usr/lib/spark/jars/calcite-core-1.2.0-incubating.jar:/usr/lib/spark/jars/hive-exec-1.2.1-spark2-amzn-0.jar:/usr/lib/spark/jars/scalap-2.11.8.jar:/usr/lib/spark/jars/parquet-jackson-1.8.2.jar:/usr/lib/spark/jars/javax.inject-2.4.0-b34.jar:/usr/lib/spark/jars/kryo-shaded-3.0.3.jar:/usr/lib/spark/jars/parquet-hadoop-bundle-1.6.0.jar:/usr/lib/spark/jars/hadoop-client-2.8.3-amzn-0.jar:/usr/lib/spark/jars/commons-httpclient-3.1.jar:/usr/lib/spark/jars/netty-all-4.0.43.Final.jar:/usr/lib/spark/jars/validation-api-1.1.0.Final.jar:/usr/lib/spark/jars/hadoop-auth-2.8.3-amzn-0.jar:/usr/lib/spark/jars/hk2-locator-2.4.0-b34.jar:/usr/lib/spark/jars/spark-ganglia-lgpl_2.11-2.2.1.jar:/usr/lib/spark/jars/arpack_combined_all-0.1.jar:/usr/lib/spark/jars/json4s-core_2.11-3.2.11.jar:/usr/lib/spark/jars/jpam-1.1.jar:/usr/lib/spark/jars/curator-recipes-2.6.0.jar:/usr/lib/spark/jars/jsp-api-2.1.jar:/usr/lib/spark/jars/hadoop-mapreduce-client-app-2.8.3-amzn-0.jar:/usr/lib/spark/jars/okhttp-2.4.0.jar:/usr/lib/spark/jars/metrics-json-3.1.2.jar:/usr/lib/spark/jars/commons-compress-1.4.1.jar:/usr/lib/spark/jars/httpcore-4.4.4.jar:/usr/lib/spark/jars/commons-beanutils-1.7.0.jar:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar:/usr/lib/spark/jars/hadoop-yarn-server-common-2.8.3-amzn-0.jar:/usr/lib/spark/jars/breeze_2.11-0.13.2.jar:/usr/lib/spark/jars/spark-graphx_2.11-2.2.1.jar:/usr/lib/spark/jars/httpclient-4.5.3.jar:/usr/lib/spark/jars/jetty-sslengine-6.1.26-emr.jar:/usr/lib/spark/jars/commons-io-2.4.jar:/usr/lib/spark/jars/parquet-column-1.8.2.jar:/usr/lib/spark/jars/jets3t-0.9.3.jar:/usr/lib/spark/jars/stax-api-1.0-2.jar:/usr/lib/spark/jars/aopalliance-repackaged-2.4.0-b34.jar:/usr/lib/spark/jars/xbean-asm5-shaded-4.4.jar:/usr/lib/spark/jars/hadoop-yarn-common-2.8.3-amzn-0.jar:/usr/lib/spark/jars/hive-beeline-1.2.1-spark2-amzn-0.jar:/usr/lib/spark/jars/jtransforms-2.4.0.jar:/usr/lib/spark/jars/apacheds-kerberos-codec-2.0.0-M15.jar:/usr/lib/spark/jars/hadoop-mapreduce-client-core-2.8.3-amzn-0.jar:/usr/lib/spark/jars/commons-net-2.2.jar:/usr/lib/spark/jars/stringtemplate-3.2.1.jar:/usr/lib/spark/jars/apacheds-i18n-2.0.0-M15.jar:/usr/lib/spark/jars/jackson-xc-1.9.13.jar:/usr/lib/spark/jars/parquet-format-2.3.1.jar:/usr/lib/spark/jars/hive-jdbc-1.2.1-spark2-amzn-0.jar:/usr/lib/spark/jars/protobuf-java-2.5.0.jar:/usr/lib/spark/jars/log4j-1.2.17.jar:/usr/lib/spark/jars/hadoop-annotations-2.8.3-amzn-0.jar:/usr/lib/spark/jars/derby-10.12.1.1.jar:/usr/lib/spark/jars/hadoop-mapreduce-client-shuffle-2.8.3-amzn-0.jar:/usr/lib/spark/jars/bcprov-jdk15on-1.51.jar:/usr/lib/spark/jars/jackson-databind-2.6.7.1.jar:/usr/lib/spark/jars/jline-2.12.1.jar:/usr/lib/spark/jars/stream-2.7.0.jar:/usr/lib/spark/jars/spark-tags_2.11-2.2.1.jar:/usr/lib/spark/jars/jersey-server-2.22.2.jar:/usr/lib/spark/jars/spark-hive_2.11-2.2.1.jar:/usr/lib/spark/jars/scala-parser-combinators_2.11-1.0.4.jar:/usr/lib/spark/jars/commons-pool-1.5.4.jar:/usr/lib/spark/jars/commons-cli-1.2.jar:/usr/lib/spark/jars/jta-1.1.jar:/usr/lib/spark/jars/snappy-0.2.jar:/usr/lib/spark/jars/commons-beanutils-core-1.8.0.jar:/usr/lib/spark/jars/hive-metastore-1.2.1-spark2-amzn-0.jar:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar:/usr/lib/spark/jars/jaxb-api-2.2.2.jar:/usr/lib/spark/jars/jackson-jaxrs-1.9.13.jar:/usr/lib/spark/jars/javax.inject-1.jar:/usr/lib/spark/jars/api-util-1.0.0-M20.jar:/usr/lib/spark/jars/javax.servlet-api-3.1.0.jar:/usr/lib/spark/jars/jsr305-1.3.9.jar:/usr/lib/spark/jars/pmml-schema-1.2.15.jar:/usr/lib/spark/jars/spark-streaming_2.11-2.2.1.jar:/usr/lib/spark/jars/libfb303-0.9.3.jar:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar:/usr/lib/spark/jars/jcip-annotations-1.0.jar:/usr/lib/spark/jars/spire-macros_2.11-0.13.0.jar:/usr/lib/spark/jars/spark-mllib_2.11-2.2.1.jar:/usr/lib/spark/jars/joda-time-2.9.3.jar:/usr/lib/spark/jars/macro-compat_2.11-1.1.1.jar:/usr/lib/spark/jars/hadoop-yarn-api-2.8.3-amzn-0.jar:/usr/lib/spark/jars/spark-mllib-local_2.11-2.2.1.jar:/usr/lib/spark/jars/aopalliance-1.0.jar:/usr/lib/spark/jars/opencsv-2.3.jar:/usr/lib/spark/jars/calcite-avatica-1.2.0-incubating.jar:/usr/lib/spark/jars/spark-catalyst_2.11-2.2.1.jar:/usr/lib/spark/jars/spark-launcher_2.11-2.2.1.jar:/usr/lib/spark/jars/pmml-model-1.2.15.jar:/usr/lib/spark/jars/commons-crypto-1.0.0.jar:/usr/lib/spark/jars/apache-log4j-extras-1.2.17.jar:/usr/lib/spark/jars/hadoop-yarn-server-web-proxy-2.8.3-amzn-0.jar:/usr/lib/spark/jars/minlog-1.3.0.jar:/etc/hadoop/conf/'

all of the following code I've posted has been tested in a jupyter notebook that has been configured by @jpolcho to run rasterframes and geopyspark. We use the notebook to try out new code and we use terraform to provision the cluster and run the spark submit command, which contains this argument --packages", "io.astraea:pyrasterframes:0.7.3-GT2,org.apache.hadoop:hadoo p-aws:2.7.3,org.apache.logging.log4j:log4j-core:2.11.1", could this argument in the spark submit be relevant and if so how would I ensure it gets accounted for in the notebookenvironment when I set up the conf object? Let me know what info you think might be helpful about the training and scoring environments and I can provide. Thanks for your help!

metasim commented 5 years ago

@rbavery TBH I'm not up on how GeoPySpark is set up and configured. For RasterFrames to work, you'll need the rasterframes_2.11 and rasterframes-datasource_2.11 jars in the classpath too. So from here it looks like those libraries aren't being deployed correctly. Maybe ping @jpolcho on the RasterFrames gitter channel?

jpolchlo commented 5 years ago

@metasim I set up the environment for @rbavery by manually building a development version of RF directly in the cluster setup. This was accomplished by a call to sbt pyrasterframes/spPublishLocal. The resulting jar is loaded into the spark environment using a --packages argument. We use the custom version mentioned above, so it definitely isn't pulling from some remote maven, and it is loading correctly and is usable. The trouble only occurs when spark ML models are being loaded. I suspect that the functionality in rasterframes-datasource isn't getting loaded in, but it's puzzling that the model saves just fine this way. Any idea what class isn't getting loaded to make that model import fail?

rbavery commented 5 years ago

When I added io.astraea:rasterframes-datasource_2.11:0.7.0-RC4 to

    # Install GeoPySpark kernel
    cat <<EOF > /tmp/kernel.json
{
    "language": "python",
    "display_name": "GeoPySpark",
    "argv": [
        "/usr/bin/python3.4",
        "-m",
        "ipykernel",
        "-f",
        "{connection_file}"
    ],
    "env": {
        "PYSPARK_PYTHON": "/usr/bin/python3.4",
        "PYSPARK_DRIVER_PYTHON": "/usr/bin/python3.4",
        "SPARK_HOME": "/usr/lib/spark",
        "PYTHONPATH": "/usr/lib/spark/python/lib/pyspark.zip:/usr/lib/spark/python/lib/py4j-0.10.4-src.zip",
        "GEOPYSPARK_JARS_PATH": "/opt/jars",
        "YARN_CONF_DIR": "/etc/hadoop/conf",
        "PYSPARK_SUBMIT_ARGS": "--conf hadoop.yarn.timeline-service.enabled=false --packages io.astraea:pyrasterframes_2.11:${RASTERFRAMES_VERSION},io.astraea:rasterframes-datasource_2.11:0.7.0-RC4 --repo
sitories https://repo.locationtech.org/content/repositories/sfcurve-releases, pyspark-shell"

I still get

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-6-2113b4b77176> in <module>()
      1 from pyspark.ml import PipelineModel
----> 2 model_load = PipelineModel.load("s3://activemapper/test.model")

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py in load(cls, path)
    255     def load(cls, path):
    256         """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 257         return cls.read().load(path)
    258 
    259 

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py in load(self, path)
    199             raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"
    200                                       % self._clazz)
--> 201         return self._clazz._from_java(java_obj)
    202 
    203     def context(self, sqlContext):

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py in _from_java(cls, java_stage)
    230         """
    231         # Load information from java_stage to the instance.
--> 232         py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
    233         # Create a new instance of this stage.
    234         py_stage = cls(py_stages)

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py in <listcomp>(.0)
    230         """
    231         # Load information from java_stage to the instance.
--> 232         py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
    233         # Create a new instance of this stage.
    234         py_stage = cls(py_stages)

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py in _from_java(java_stage)
    200         stage_name = java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
    201         # Generate a default new instance from the stage_name class.
--> 202         py_type = __get_class(stage_name)
    203         if issubclass(py_type, JavaParams):
    204             # Load information from java_stage to the instance.

/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py in __get_class(clazz)
    194             parts = clazz.split('.')
    195             module = ".".join(parts[:-1])
--> 196             m = __import__(module)
    197             for comp in parts[1:]:
    198                 m = getattr(m, comp)

ImportError: No module named 'astraea'

Thanks for your help @jpolchlo and @metasim !

metasim commented 5 years ago

@jpolchlo Thanks for your help!

@rbavery ~Two~ Three things come to mind:

  1. Consider trying the most recent version of rasterframes, which is under the org.locationtech.rasterframes group ID: org.locationtech.rasterframes:rasterframes_2.11:0.7.1
  2. From what you have above, what happens when you add the rasterframes_2.11 artifact to the --packages list (you only have pyrasterframes and rasterframes-datasource? I would have thought the transitive dependency loading would have picked it up, but it can't hurt to be more explicit.
  3. If none of that works I'll see if one of my Python-oriented colleagues can help me figure out what's going on. I'm very puzzled by it, especially considering training mode actually works.
jpolchlo commented 5 years ago

Ran the above test. This was on EMR with Spark and Hive loaded. Started pyspark with the following command:

pyspark --master yarn --packages org.locationtech.rasterframes:pyrasterframes_2.11:0.7.1,org.locationtech.rasterframes:rasterframes_2.11:0.7.1,org.locationtech.rasterframes:rasterframes-datasource_2.11:0.7.1,org.apache.hadoop:hadoop-aws:2.7.3,org.apache.logging.log4j:log4j-core:2.11.1

Ran the two statements

from pyspark.ml import PipelineModel
model_load = PipelineModel.load("s3://<bucket>/test.model")

Encountered the same error:

Traceback (most recent call last):                                              
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/ml/util.py", line 257, in load
    return cls.read().load(path)
  File "/usr/lib/spark/python/pyspark/ml/util.py", line 201, in load
    return self._clazz._from_java(java_obj)
  File "/usr/lib/spark/python/pyspark/ml/pipeline.py", line 232, in _from_java
    py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
  File "/usr/lib/spark/python/pyspark/ml/pipeline.py", line 232, in <listcomp>
    py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
  File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 202, in _from_java
    py_type = __get_class(stage_name)
  File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 196, in __get_class
    m = __import__(module)
ImportError: No module named 'astraea'

This is especially frustrating in that if pyspark is replaced with spark-shell, then import astraea.spark.rasterframes.ml.NoDataFilter works without complaint. Clearly the jar artifacts contain the necessary classes. In fact, PipelineModel.load(...) works just fine on the Scala side. Attempting spark._jvm.astraea.spark.rasterframes.ml.NoDataFilter.load(...) from pyspark produces the same error as on the scala side, so the artifacts are available via the py4j gateway.

Getting to be at a loss for solutions here. The environment seems to be properly configured, but for some reason, inaccessible from python in whatever way spark needs it to be.

Thoughts, @metasim?

metasim commented 5 years ago

@jpolchlo @rbavery I got a little further with these scripts (but not total success):

https://gist.github.com/metasim/c4d1225124fcf53cac2d04330ee3f411

 File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/pyspark/shell.py", line 88, in <module>
    exec(code)
  File "read-model.py", line 13, in <module>
    model_load = PipelineModel.load("test_model")
  File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/pyspark/ml/util.py", line 257, in load
    return cls.read().load(path)
  File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/pyspark/ml/util.py", line 201, in load
    return self._clazz._from_java(java_obj)
  File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/pyspark/ml/pipeline.py", line 232, in _from_java
    py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
  File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/pyspark/ml/pipeline.py", line 232, in <listcomp>
    py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
  File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/pyspark/ml/wrapper.py", line 202, in _from_java
    py_type = __get_class(stage_name)
  File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/pyspark/ml/wrapper.py", line 198, in __get_class
    m = getattr(m, comp)
AttributeError: module 'astraea.spark.rasterframes.ml' has no attribute 'NoDataFilter'

However, you can run this with success:

>>> dir(spark._jvm.astraea.spark.rasterframes.ml.NoDataFilter)
['load', 'read']

which says to me that the classes are on the classpath.

I'm not sure what else to do at the point, other than find others who have written ML transformers in Scala and been able to save/restore from Python. I'm wondering if you either have to have mirror classes in Python for each Scala class, or if there is some additional metadata required to instruct python how to reconstitute the Pipeline.

I'm sorry this has been a long road to what might be a dead end. If the goal is just to score a dataset, perhaps doing it in Scala isn't such a heavy lift.

cc: @bguseman

jpolchlo commented 5 years ago

It appears that for Spark 2.2.1, the PipelineModel loader relies on there being a python wrapper class in a python module of the same name as the java class that is listed in the model metadata (in this case, the python class astraea.spark.rasterframes.ml.NoDataFilter must exist). It appears that the default Python readers in spark 2.3.2 might be smart enough to construct an appropriate wrapper class for the underlying RF java pipeline object, but this remains untested.

I believe we forked off a version of RF before spark 2.3 compatibility was established, and then built new functionality on top of that. Looks like the solution might be to rebase the changes we made onto a newer version of RF and bump up the EMR version to one that has spark 2.3 or newer.

metasim commented 5 years ago

@jpolchlo Thanks for the update.... super helpful! Is there some functionality from your fork you want rolled in?

jpolchlo commented 5 years ago

We have some very rough code in our fork that establishes some pseudo-focal operations (via overbuffered tiles). See https://github.com/geotrellis/rasterframes/tree/feature/focal-operations. This is a model for how to provide the functionality, but not an implementation I'd really stand behind. Real GT BufferedTiles should be promoted for that purpose, but that interface in GT isn't what I'd characterize as "battle-hardened". Might be a good opportunity to get some work in to geotrellis/geotrellis-contrib to make sure that proper buffering of tiles is built in to the new RasterSource interface.

metasim commented 5 years ago

@jpolchlo We definitely need buffered tiles in RF, so maybe that will be a good topic of discussion for when I visit!