Open rbavery opened 6 years ago
@rbavery Sorry for the delay in responding to this. Can you confirm that before you tried to load the model back in that you've called the withRasterFrames()
method on the SparkSession
? My initial guess is that it's behaving as if the JVM library isn't loaded.
Are you willing to share the JSON component of the saved workflow?
Hi @metasim thanks for the reply. I definitely called .withRasterFrames, here is what that section looks like
conf = gps\
.geopyspark_conf(appName="CVML Model Iteration", master="yarn") \
.set("spark.dynamicAllocation.enabled", True) \
.set("spark.ui.enabled", True) \
.set("spark.hadoop.yarn.timeline-service.enabled", False)
spark = SparkSession\
.builder\
.config(conf=conf)\
.getOrCreate()\
.withRasterFrames()
logger = spark._jvm.org.apache.log4j.Logger.getLogger("Main")
Attached is the saved workflow model_pipeline.zip
@rbavery Yeh, the saved workflow looks fine. Have you tried loading it in local
mode? I'm wondering if it's a YARN deployment thing.... the requisite jar
and/or zip
not getting delivered properly.
I'm inclined to want to ask @jbouffard on Gitter if he has seen this error before, as he's the person that seems to have done the most with pyrasterframes outside of my organization (where we may have inadvertently made assumptions about runtime environment).
@metasim haven't tried local mode, would I set it like this, keeping everything else constant?
conf = gps\
.geopyspark_conf(appName="CVML Model Iteration", master="local") \
.set("spark.dynamicAllocation.enabled", True) \
.set("spark.ui.enabled", True) \
.set("spark.hadoop.yarn.timeline-service.enabled", False)
spark = SparkSession\
.builder\
.config(conf=conf)\
.getOrCreate()\
.withRasterFrames()
logger = spark._jvm.org.apache.log4j.Logger.getLogger("Main")
Thanks for the help, I appreciate it.
@metasim Heads up I got a slightly different error when trying the above, just switching master="yarn"
to master=''local"
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-3-2113b4b77176> in <module>()
1 from pyspark.ml import PipelineModel
----> 2 model_load = PipelineModel.load("s3://activemapper/test.model")
/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py in load(cls, path)
255 def load(cls, path):
256 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 257 return cls.read().load(path)
258
259
/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py in load(self, path)
199 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"
200 % self._clazz)
--> 201 return self._clazz._from_java(java_obj)
202
203 def context(self, sqlContext):
/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py in _from_java(cls, java_stage)
230 """
231 # Load information from java_stage to the instance.
--> 232 py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
233 # Create a new instance of this stage.
234 py_stage = cls(py_stages)
/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py in <listcomp>(.0)
230 """
231 # Load information from java_stage to the instance.
--> 232 py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
233 # Create a new instance of this stage.
234 py_stage = cls(py_stages)
/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py in _from_java(java_stage)
200 stage_name = java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
201 # Generate a default new instance from the stage_name class.
--> 202 py_type = __get_class(stage_name)
203 if issubclass(py_type, JavaParams):
204 # Load information from java_stage to the instance.
/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py in __get_class(clazz)
194 parts = clazz.split('.')
195 module = ".".join(parts[:-1])
--> 196 m = __import__(module)
197 for comp in parts[1:]:
198 m = getattr(m, comp)
ImportError: No module named 'astraea.spark'
@rbavery I'm sorry this continues to be a problem. Would you be able to run the following before you try to load the model and confirm that there's a rasterframes
jar file in the classpath?
spark._jvm.java.lang.System.getProperty('java.class.path')
If not, I'd need to know more about the difference between your training vs. scoring environment, to determine why RasterFrames was in the classspath in the former case, but not the latter.
Looks like there is not. I ran the following
probability_images = 3 # Write this many probability images to location specified in YAML file
seed = 42 #Use this seed for sampling probability images
complete_catalog = False #Produce the entire catalog of probability images
s3_bucket = 'activemapper'
run_id = 1
aoi_name = 'ghana-aoi4-test'
start_time = time.time()
conf = gps\
.geopyspark_conf(appName="CVML Model Iteration", master="local") \
.set("spark.dynamicAllocation.enabled", True) \
.set("spark.ui.enabled", True) \
.set("spark.hadoop.yarn.timeline-service.enabled", False)
spark = SparkSession\
.builder\
.config(conf=conf)\
.getOrCreate()\
.withRasterFrames()
logger = spark._jvm.org.apache.log4j.Logger.getLogger("Main")
spark._jvm.java.lang.System.getProperty('java.class.path')
and got
'/usr/lib/hadoop-lzo/lib/hadoop-lzo.jar:/usr/lib/hadoop-lzo/lib/hadoop-lzo-0.4.19.jar:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-codedeploy-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-ssm-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudformation-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-xray-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-servicediscovery-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-mobile-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-polly-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-servermigration-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-devicefarm-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudhsmv2-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-test-utils-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloud9-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-workmail-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-workspaces-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-emr-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-appstream-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-machinelearning-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-organizations-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-rekognition-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-discovery-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-api-gateway-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-appsync-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-lightsail-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudtrail-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-kinesis-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-acm-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-codestar-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-elasticache-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-route53-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-elasticloadbalancingv2-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-support-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudsearch-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudhsm-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-directconnect-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-codecommit-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-iot-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-rds-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-storagegateway-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-kinesisvideo-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-redshift-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-iotjobsdataplane-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-mq-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-lambda-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-s3-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-shield-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-resourcegroups-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-ecr-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-costandusagereport-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-efs-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-codegen-maven-plugin-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-budgets-1.11.267.jar:/usr/share/aws/aws-java-sdk/jmespath-java-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-pricing-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cognitoidentity-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-models-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-glacier-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-serverlessapplicationrepository-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-gamelift-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-directory-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-batch-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-lex-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-elastictranscoder-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-mediastore-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-health-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-comprehend-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-iam-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-autoscaling-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-ses-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-guardduty-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cognitoidp-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-elasticsearch-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-simpledb-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-config-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-dynamodb-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-sagemakerruntime-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-marketplaceentitlement-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-waf-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-stepfunctions-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-greengrass-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-importexport-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-kms-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-core-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-resourcegroupstaggingapi-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-athena-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-datapipeline-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-sqs-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-glue-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-migrationhub-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudwatchmetrics-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-sts-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-marketplacecommerceanalytics-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-sagemaker-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-costexplorer-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-mechanicalturkrequester-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-mediaconvert-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-dms-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-snowball-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-dax-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-ec2-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-opsworks-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-mediastoredata-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-code-generator-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-logs-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-lexmodelbuilding-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-elasticbeanstalk-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-alexaforbusiness-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-workdocs-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-opsworkscm-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-simpleworkflow-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudwatch-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-marketplacemeteringservice-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-translate-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-clouddirectory-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-events-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-mediapackage-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-autoscalingplans-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-codebuild-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-inspector-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-sns-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-servicecatalog-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-pinpoint-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudfront-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-opensdk-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-applicationautoscaling-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-ecs-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-medialive-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cognitosync-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-codepipeline-1.11.267.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-elasticloadbalancing-1.11.267.jar:/usr/share/aws/emr/emrfs/conf/:/usr/share/aws/emr/emrfs/lib/jcl-over-slf4j-1.7.21.jar:/usr/share/aws/emr/emrfs/lib/bcpkix-jdk15on-1.51.jar:/usr/share/aws/emr/emrfs/lib/jmespath-java-1.11.267.jar:/usr/share/aws/emr/emrfs/lib/ion-java-1.0.2.jar:/usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.21.0.jar:/usr/share/aws/emr/emrfs/lib/bcprov-jdk15on-1.51.jar:/usr/share/aws/emr/emrfs/lib/javax.inject-1.jar:/usr/share/aws/emr/emrfs/lib/slf4j-api-1.7.21.jar:/usr/share/aws/emr/emrfs/lib/aopalliance-1.0.jar:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/lib/spark/conf/:/usr/lib/spark/jars/avro-1.7.7.jar:/usr/lib/spark/jars/htrace-core4-4.0.1-incubating.jar:/usr/lib/spark/jars/guice-3.0.jar:/usr/lib/spark/jars/jul-to-slf4j-1.7.16.jar:/usr/lib/spark/jars/hive-cli-1.2.1-spark2-amzn-0.jar:/usr/lib/spark/jars/spark-network-common_2.11-2.2.1.jar:/usr/lib/spark/jars/scala-library-2.11.8.jar:/usr/lib/spark/jars/spark-repl_2.11-2.2.1.jar:/usr/lib/spark/jars/gson-2.2.4.jar:/usr/lib/spark/jars/janino-3.0.0.jar:/usr/lib/spark/jars/commons-logging-1.1.3.jar:/usr/lib/spark/jars/hadoop-common-2.8.3-amzn-0.jar:/usr/lib/spark/jars/commons-lang-2.6.jar:/usr/lib/spark/jars/RoaringBitmap-0.5.11.jar:/usr/lib/spark/jars/ivy-2.4.0.jar:/usr/lib/spark/jars/machinist_2.11-0.6.1.jar:/usr/lib/spark/jars/libthrift-0.9.3.jar:/usr/lib/spark/jars/javax.ws.rs-api-2.0.1.jar:/usr/lib/spark/jars/metrics-core-3.1.2.jar:/usr/lib/spark/jars/pyrolite-4.13.jar:/usr/lib/spark/jars/jetty-6.1.26-emr.jar:/usr/lib/spark/jars/spark-network-shuffle_2.11-2.2.1.jar:/usr/lib/spark/jars/hadoop-mapreduce-client-common-2.8.3-amzn-0.jar:/usr/lib/spark/jars/api-asn1-api-1.0.0-M20.jar:/usr/lib/spark/jars/jersey-common-2.22.2.jar:/usr/lib/spark/jars/py4j-0.10.4.jar:/usr/lib/spark/jars/activation-1.1.1.jar:/usr/lib/spark/jars/mail-1.4.7.jar:/usr/lib/spark/jars/jackson-module-paranamer-2.6.7.jar:/usr/lib/spark/jars/json-smart-1.1.1.jar:/usr/lib/spark/jars/xz-1.0.jar:/usr/lib/spark/jars/spark-sketch_2.11-2.2.1.jar:/usr/lib/spark/jars/slf4j-api-1.7.16.jar:/usr/lib/spark/jars/guava-14.0.1.jar:/usr/lib/spark/jars/json4s-jackson_2.11-3.2.11.jar:/usr/lib/spark/jars/univocity-parsers-2.2.1.jar:/usr/lib/spark/jars/core-1.1.2.jar:/usr/lib/spark/jars/jetty-util-6.1.26-emr.jar:/usr/lib/spark/jars/antlr-runtime-3.4.jar:/usr/lib/spark/jars/hadoop-mapreduce-client-jobclient-2.8.3-amzn-0.jar:/usr/lib/spark/jars/jersey-media-jaxb-2.22.2.jar:/usr/lib/spark/jars/hadoop-yarn-client-2.8.3-amzn-0.jar:/usr/lib/spark/jars/objenesis-2.1.jar:/usr/lib/spark/jars/commons-dbcp-1.4.jar:/usr/lib/spark/jars/jodd-core-3.5.2.jar:/usr/lib/spark/jars/spark-unsafe_2.11-2.2.1.jar:/usr/lib/spark/jars/lz4-1.3.0.jar:/usr/lib/spark/jars/spark-hive-thriftserver_2.11-2.2.1.jar:/usr/lib/spark/jars/calcite-linq4j-1.2.0-incubating.jar:/usr/lib/spark/jars/json4s-ast_2.11-3.2.11.jar:/usr/lib/spark/jars/jackson-core-asl-1.9.13.jar:/usr/lib/spark/jars/spark-sql_2.11-2.2.1.jar:/usr/lib/spark/jars/parquet-common-1.8.2.jar:/usr/lib/spark/jars/jersey-client-2.22.2.jar:/usr/lib/spark/jars/ST4-4.0.4.jar:/usr/lib/spark/jars/oro-2.0.8.jar:/usr/lib/spark/jars/jdo-api-3.0.1.jar:/usr/lib/spark/jars/parquet-hadoop-1.8.2.jar:/usr/lib/spark/jars/scala-compiler-2.11.8.jar:/usr/lib/spark/jars/jackson-mapper-asl-1.9.13.jar:/usr/lib/spark/jars/compress-lzf-1.0.3.jar:/usr/lib/spark/jars/metrics-ganglia-3.1.2.jar:/usr/lib/spark/jars/stax-api-1.0.1.jar:/usr/lib/spark/jars/xmlenc-0.52.jar:/usr/lib/spark/jars/avro-mapred-1.7.7-hadoop2.jar:/usr/lib/spark/jars/oncrpc-1.0.7.jar:/usr/lib/spark/jars/shapeless_2.11-2.3.2.jar:/usr/lib/spark/jars/eigenbase-properties-1.1.5.jar:/usr/lib/spark/jars/commons-collections-3.2.2.jar:/usr/lib/spark/jars/super-csv-2.2.0.jar:/usr/lib/spark/jars/netty-3.9.9.Final.jar:/usr/lib/spark/jars/commons-lang3-3.5.jar:/usr/lib/spark/jars/javassist-3.18.1-GA.jar:/usr/lib/spark/jars/spark-yarn_2.11-2.2.1.jar:/usr/lib/spark/jars/curator-client-2.6.0.jar:/usr/lib/spark/jars/jcl-over-slf4j-1.7.16.jar:/usr/lib/spark/jars/JavaEWAH-0.3.2.jar:/usr/lib/spark/jars/spark-core_2.11-2.2.1.jar:/usr/lib/spark/jars/metrics-jvm-3.1.2.jar:/usr/lib/spark/jars/metrics-graphite-3.1.2.jar:/usr/lib/spark/jars/chill-java-0.8.0.jar:/usr/lib/spark/jars/commons-configuration-1.6.jar:/usr/lib/spark/jars/jersey-container-servlet-2.22.2.jar:/usr/lib/spark/jars/commons-codec-1.10.jar:/usr/lib/spark/jars/commons-compiler-3.0.0.jar:/usr/lib/spark/jars/jackson-module-scala_2.11-2.6.7.1.jar:/usr/lib/spark/jars/antlr4-runtime-4.5.3.jar:/usr/lib/spark/jars/base64-2.3.8.jar:/usr/lib/spark/jars/javax.annotation-api-1.2.jar:/usr/lib/spark/jars/nimbus-jose-jwt-3.9.jar:/usr/lib/spark/jars/gmetric4j-1.0.7.jar:/usr/lib/spark/jars/commons-digester-1.8.jar:/usr/lib/spark/jars/parquet-encoding-1.8.2.jar:/usr/lib/spark/jars/breeze-macros_2.11-0.13.2.jar:/usr/lib/spark/jars/javolution-5.5.1.jar:/usr/lib/spark/jars/avro-ipc-1.7.7.jar:/usr/lib/spark/jars/spire_2.11-0.13.0.jar:/usr/lib/spark/jars/jackson-core-2.6.7.jar:/usr/lib/spark/jars/hk2-utils-2.4.0-b34.jar:/usr/lib/spark/jars/leveldbjni-all-1.8.jar:/usr/lib/spark/jars/hk2-api-2.4.0-b34.jar:/usr/lib/spark/jars/snappy-java-1.1.2.6.jar:/usr/lib/spark/jars/jersey-container-servlet-core-2.22.2.jar:/usr/lib/spark/jars/curator-framework-2.6.0.jar:/usr/lib/spark/jars/zookeeper-3.4.6.jar:/usr/lib/spark/jars/mx4j-3.0.2.jar:/usr/lib/spark/jars/paranamer-2.6.jar:/usr/lib/spark/jars/scala-xml_2.11-1.0.2.jar:/usr/lib/spark/jars/guice-servlet-3.0.jar:/usr/lib/spark/jars/osgi-resource-locator-1.0.1.jar:/usr/lib/spark/jars/chill_2.11-0.8.0.jar:/usr/lib/spark/jars/antlr-2.7.7.jar:/usr/lib/spark/jars/okio-1.4.0.jar:/usr/lib/spark/jars/commons-math3-3.4.1.jar:/usr/lib/spark/jars/bonecp-0.8.0.RELEASE.jar:/usr/lib/spark/jars/jackson-annotations-2.6.7.jar:/usr/lib/spark/jars/hadoop-hdfs-client-2.8.3-amzn-0.jar:/usr/lib/spark/jars/jackson-dataformat-cbor-2.6.7.jar:/usr/lib/spark/jars/scala-reflect-2.11.8.jar:/usr/lib/spark/jars/jersey-guava-2.22.2.jar:/usr/lib/spark/jars/java-xmlbuilder-1.0.jar:/usr/lib/spark/jars/calcite-core-1.2.0-incubating.jar:/usr/lib/spark/jars/hive-exec-1.2.1-spark2-amzn-0.jar:/usr/lib/spark/jars/scalap-2.11.8.jar:/usr/lib/spark/jars/parquet-jackson-1.8.2.jar:/usr/lib/spark/jars/javax.inject-2.4.0-b34.jar:/usr/lib/spark/jars/kryo-shaded-3.0.3.jar:/usr/lib/spark/jars/parquet-hadoop-bundle-1.6.0.jar:/usr/lib/spark/jars/hadoop-client-2.8.3-amzn-0.jar:/usr/lib/spark/jars/commons-httpclient-3.1.jar:/usr/lib/spark/jars/netty-all-4.0.43.Final.jar:/usr/lib/spark/jars/validation-api-1.1.0.Final.jar:/usr/lib/spark/jars/hadoop-auth-2.8.3-amzn-0.jar:/usr/lib/spark/jars/hk2-locator-2.4.0-b34.jar:/usr/lib/spark/jars/spark-ganglia-lgpl_2.11-2.2.1.jar:/usr/lib/spark/jars/arpack_combined_all-0.1.jar:/usr/lib/spark/jars/json4s-core_2.11-3.2.11.jar:/usr/lib/spark/jars/jpam-1.1.jar:/usr/lib/spark/jars/curator-recipes-2.6.0.jar:/usr/lib/spark/jars/jsp-api-2.1.jar:/usr/lib/spark/jars/hadoop-mapreduce-client-app-2.8.3-amzn-0.jar:/usr/lib/spark/jars/okhttp-2.4.0.jar:/usr/lib/spark/jars/metrics-json-3.1.2.jar:/usr/lib/spark/jars/commons-compress-1.4.1.jar:/usr/lib/spark/jars/httpcore-4.4.4.jar:/usr/lib/spark/jars/commons-beanutils-1.7.0.jar:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar:/usr/lib/spark/jars/hadoop-yarn-server-common-2.8.3-amzn-0.jar:/usr/lib/spark/jars/breeze_2.11-0.13.2.jar:/usr/lib/spark/jars/spark-graphx_2.11-2.2.1.jar:/usr/lib/spark/jars/httpclient-4.5.3.jar:/usr/lib/spark/jars/jetty-sslengine-6.1.26-emr.jar:/usr/lib/spark/jars/commons-io-2.4.jar:/usr/lib/spark/jars/parquet-column-1.8.2.jar:/usr/lib/spark/jars/jets3t-0.9.3.jar:/usr/lib/spark/jars/stax-api-1.0-2.jar:/usr/lib/spark/jars/aopalliance-repackaged-2.4.0-b34.jar:/usr/lib/spark/jars/xbean-asm5-shaded-4.4.jar:/usr/lib/spark/jars/hadoop-yarn-common-2.8.3-amzn-0.jar:/usr/lib/spark/jars/hive-beeline-1.2.1-spark2-amzn-0.jar:/usr/lib/spark/jars/jtransforms-2.4.0.jar:/usr/lib/spark/jars/apacheds-kerberos-codec-2.0.0-M15.jar:/usr/lib/spark/jars/hadoop-mapreduce-client-core-2.8.3-amzn-0.jar:/usr/lib/spark/jars/commons-net-2.2.jar:/usr/lib/spark/jars/stringtemplate-3.2.1.jar:/usr/lib/spark/jars/apacheds-i18n-2.0.0-M15.jar:/usr/lib/spark/jars/jackson-xc-1.9.13.jar:/usr/lib/spark/jars/parquet-format-2.3.1.jar:/usr/lib/spark/jars/hive-jdbc-1.2.1-spark2-amzn-0.jar:/usr/lib/spark/jars/protobuf-java-2.5.0.jar:/usr/lib/spark/jars/log4j-1.2.17.jar:/usr/lib/spark/jars/hadoop-annotations-2.8.3-amzn-0.jar:/usr/lib/spark/jars/derby-10.12.1.1.jar:/usr/lib/spark/jars/hadoop-mapreduce-client-shuffle-2.8.3-amzn-0.jar:/usr/lib/spark/jars/bcprov-jdk15on-1.51.jar:/usr/lib/spark/jars/jackson-databind-2.6.7.1.jar:/usr/lib/spark/jars/jline-2.12.1.jar:/usr/lib/spark/jars/stream-2.7.0.jar:/usr/lib/spark/jars/spark-tags_2.11-2.2.1.jar:/usr/lib/spark/jars/jersey-server-2.22.2.jar:/usr/lib/spark/jars/spark-hive_2.11-2.2.1.jar:/usr/lib/spark/jars/scala-parser-combinators_2.11-1.0.4.jar:/usr/lib/spark/jars/commons-pool-1.5.4.jar:/usr/lib/spark/jars/commons-cli-1.2.jar:/usr/lib/spark/jars/jta-1.1.jar:/usr/lib/spark/jars/snappy-0.2.jar:/usr/lib/spark/jars/commons-beanutils-core-1.8.0.jar:/usr/lib/spark/jars/hive-metastore-1.2.1-spark2-amzn-0.jar:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar:/usr/lib/spark/jars/jaxb-api-2.2.2.jar:/usr/lib/spark/jars/jackson-jaxrs-1.9.13.jar:/usr/lib/spark/jars/javax.inject-1.jar:/usr/lib/spark/jars/api-util-1.0.0-M20.jar:/usr/lib/spark/jars/javax.servlet-api-3.1.0.jar:/usr/lib/spark/jars/jsr305-1.3.9.jar:/usr/lib/spark/jars/pmml-schema-1.2.15.jar:/usr/lib/spark/jars/spark-streaming_2.11-2.2.1.jar:/usr/lib/spark/jars/libfb303-0.9.3.jar:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar:/usr/lib/spark/jars/jcip-annotations-1.0.jar:/usr/lib/spark/jars/spire-macros_2.11-0.13.0.jar:/usr/lib/spark/jars/spark-mllib_2.11-2.2.1.jar:/usr/lib/spark/jars/joda-time-2.9.3.jar:/usr/lib/spark/jars/macro-compat_2.11-1.1.1.jar:/usr/lib/spark/jars/hadoop-yarn-api-2.8.3-amzn-0.jar:/usr/lib/spark/jars/spark-mllib-local_2.11-2.2.1.jar:/usr/lib/spark/jars/aopalliance-1.0.jar:/usr/lib/spark/jars/opencsv-2.3.jar:/usr/lib/spark/jars/calcite-avatica-1.2.0-incubating.jar:/usr/lib/spark/jars/spark-catalyst_2.11-2.2.1.jar:/usr/lib/spark/jars/spark-launcher_2.11-2.2.1.jar:/usr/lib/spark/jars/pmml-model-1.2.15.jar:/usr/lib/spark/jars/commons-crypto-1.0.0.jar:/usr/lib/spark/jars/apache-log4j-extras-1.2.17.jar:/usr/lib/spark/jars/hadoop-yarn-server-web-proxy-2.8.3-amzn-0.jar:/usr/lib/spark/jars/minlog-1.3.0.jar:/etc/hadoop/conf/'
all of the following code I've posted has been tested in a jupyter notebook that has been configured by @jpolcho to run rasterframes and geopyspark. We use the notebook to try out new code and we use terraform to provision the cluster and run the spark submit command, which contains this argument --packages", "io.astraea:pyrasterframes:0.7.3-GT2,org.apache.hadoop:hadoo p-aws:2.7.3,org.apache.logging.log4j:log4j-core:2.11.1"
, could this argument in the spark submit be relevant and if so how would I ensure it gets accounted for in the notebookenvironment when I set up the conf
object? Let me know what info you think might be helpful about the training and scoring environments and I can provide. Thanks for your help!
@rbavery TBH I'm not up on how GeoPySpark is set up and configured. For RasterFrames to work, you'll need the rasterframes_2.11
and rasterframes-datasource_2.11
jars in the classpath too. So from here it looks like those libraries aren't being deployed correctly. Maybe ping @jpolcho on the RasterFrames gitter channel?
@metasim I set up the environment for @rbavery by manually building a development version of RF directly in the cluster setup. This was accomplished by a call to sbt pyrasterframes/spPublishLocal
. The resulting jar is loaded into the spark environment using a --packages
argument. We use the custom version mentioned above, so it definitely isn't pulling from some remote maven, and it is loading correctly and is usable. The trouble only occurs when spark ML models are being loaded. I suspect that the functionality in rasterframes-datasource
isn't getting loaded in, but it's puzzling that the model saves just fine this way. Any idea what class isn't getting loaded to make that model import fail?
When I added
io.astraea:rasterframes-datasource_2.11:0.7.0-RC4
to
# Install GeoPySpark kernel
cat <<EOF > /tmp/kernel.json
{
"language": "python",
"display_name": "GeoPySpark",
"argv": [
"/usr/bin/python3.4",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"PYSPARK_PYTHON": "/usr/bin/python3.4",
"PYSPARK_DRIVER_PYTHON": "/usr/bin/python3.4",
"SPARK_HOME": "/usr/lib/spark",
"PYTHONPATH": "/usr/lib/spark/python/lib/pyspark.zip:/usr/lib/spark/python/lib/py4j-0.10.4-src.zip",
"GEOPYSPARK_JARS_PATH": "/opt/jars",
"YARN_CONF_DIR": "/etc/hadoop/conf",
"PYSPARK_SUBMIT_ARGS": "--conf hadoop.yarn.timeline-service.enabled=false --packages io.astraea:pyrasterframes_2.11:${RASTERFRAMES_VERSION},io.astraea:rasterframes-datasource_2.11:0.7.0-RC4 --repo
sitories https://repo.locationtech.org/content/repositories/sfcurve-releases, pyspark-shell"
I still get
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-6-2113b4b77176> in <module>()
1 from pyspark.ml import PipelineModel
----> 2 model_load = PipelineModel.load("s3://activemapper/test.model")
/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py in load(cls, path)
255 def load(cls, path):
256 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 257 return cls.read().load(path)
258
259
/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py in load(self, path)
199 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"
200 % self._clazz)
--> 201 return self._clazz._from_java(java_obj)
202
203 def context(self, sqlContext):
/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py in _from_java(cls, java_stage)
230 """
231 # Load information from java_stage to the instance.
--> 232 py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
233 # Create a new instance of this stage.
234 py_stage = cls(py_stages)
/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py in <listcomp>(.0)
230 """
231 # Load information from java_stage to the instance.
--> 232 py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
233 # Create a new instance of this stage.
234 py_stage = cls(py_stages)
/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py in _from_java(java_stage)
200 stage_name = java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
201 # Generate a default new instance from the stage_name class.
--> 202 py_type = __get_class(stage_name)
203 if issubclass(py_type, JavaParams):
204 # Load information from java_stage to the instance.
/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py in __get_class(clazz)
194 parts = clazz.split('.')
195 module = ".".join(parts[:-1])
--> 196 m = __import__(module)
197 for comp in parts[1:]:
198 m = getattr(m, comp)
ImportError: No module named 'astraea'
Thanks for your help @jpolchlo and @metasim !
@jpolchlo Thanks for your help!
@rbavery ~Two~ Three things come to mind:
org.locationtech.rasterframes
group ID: org.locationtech.rasterframes:rasterframes_2.11:0.7.1
rasterframes_2.11
artifact to the --packages
list (you only have pyrasterframes
and rasterframes-datasource
? I would have thought the transitive dependency loading would have picked it up, but it can't hurt to be more explicit.Ran the above test. This was on EMR with Spark and Hive loaded. Started pyspark
with the following command:
pyspark --master yarn --packages org.locationtech.rasterframes:pyrasterframes_2.11:0.7.1,org.locationtech.rasterframes:rasterframes_2.11:0.7.1,org.locationtech.rasterframes:rasterframes-datasource_2.11:0.7.1,org.apache.hadoop:hadoop-aws:2.7.3,org.apache.logging.log4j:log4j-core:2.11.1
Ran the two statements
from pyspark.ml import PipelineModel
model_load = PipelineModel.load("s3://<bucket>/test.model")
Encountered the same error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/ml/util.py", line 257, in load
return cls.read().load(path)
File "/usr/lib/spark/python/pyspark/ml/util.py", line 201, in load
return self._clazz._from_java(java_obj)
File "/usr/lib/spark/python/pyspark/ml/pipeline.py", line 232, in _from_java
py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
File "/usr/lib/spark/python/pyspark/ml/pipeline.py", line 232, in <listcomp>
py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 202, in _from_java
py_type = __get_class(stage_name)
File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 196, in __get_class
m = __import__(module)
ImportError: No module named 'astraea'
This is especially frustrating in that if pyspark
is replaced with spark-shell
, then import astraea.spark.rasterframes.ml.NoDataFilter
works without complaint. Clearly the jar artifacts contain the necessary classes. In fact, PipelineModel.load(...)
works just fine on the Scala side. Attempting spark._jvm.astraea.spark.rasterframes.ml.NoDataFilter.load(...)
from pyspark produces the same error as on the scala side, so the artifacts are available via the py4j gateway.
Getting to be at a loss for solutions here. The environment seems to be properly configured, but for some reason, inaccessible from python in whatever way spark needs it to be.
Thoughts, @metasim?
@jpolchlo @rbavery I got a little further with these scripts (but not total success):
https://gist.github.com/metasim/c4d1225124fcf53cac2d04330ee3f411
File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/pyspark/shell.py", line 88, in <module>
exec(code)
File "read-model.py", line 13, in <module>
model_load = PipelineModel.load("test_model")
File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/pyspark/ml/util.py", line 257, in load
return cls.read().load(path)
File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/pyspark/ml/util.py", line 201, in load
return self._clazz._from_java(java_obj)
File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/pyspark/ml/pipeline.py", line 232, in _from_java
py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/pyspark/ml/pipeline.py", line 232, in <listcomp>
py_stages = [JavaParams._from_java(s) for s in java_stage.stages()]
File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/pyspark/ml/wrapper.py", line 202, in _from_java
py_type = __get_class(stage_name)
File "/usr/local/Cellar/apache-spark/2.2.1/libexec/python/pyspark/ml/wrapper.py", line 198, in __get_class
m = getattr(m, comp)
AttributeError: module 'astraea.spark.rasterframes.ml' has no attribute 'NoDataFilter'
However, you can run this with success:
>>> dir(spark._jvm.astraea.spark.rasterframes.ml.NoDataFilter)
['load', 'read']
which says to me that the classes are on the classpath.
I'm not sure what else to do at the point, other than find others who have written ML transformers in Scala and been able to save/restore from Python. I'm wondering if you either have to have mirror classes in Python for each Scala class, or if there is some additional metadata required to instruct python how to reconstitute the Pipeline.
I'm sorry this has been a long road to what might be a dead end. If the goal is just to score a dataset, perhaps doing it in Scala isn't such a heavy lift.
cc: @bguseman
It appears that for Spark 2.2.1, the PipelineModel
loader relies on there being a python wrapper class in a python module of the same name as the java class that is listed in the model metadata (in this case, the python class astraea.spark.rasterframes.ml.NoDataFilter
must exist). It appears that the default Python readers in spark 2.3.2 might be smart enough to construct an appropriate wrapper class for the underlying RF java pipeline object, but this remains untested.
I believe we forked off a version of RF before spark 2.3 compatibility was established, and then built new functionality on top of that. Looks like the solution might be to rebase the changes we made onto a newer version of RF and bump up the EMR version to one that has spark 2.3 or newer.
@jpolchlo Thanks for the update.... super helpful! Is there some functionality from your fork you want rolled in?
We have some very rough code in our fork that establishes some pseudo-focal operations (via overbuffered tiles). See https://github.com/geotrellis/rasterframes/tree/feature/focal-operations. This is a model for how to provide the functionality, but not an implementation I'd really stand behind. Real GT BufferedTile
s should be promoted for that purpose, but that interface in GT isn't what I'd characterize as "battle-hardened". Might be a good opportunity to get some work in to geotrellis/geotrellis-contrib to make sure that proper buffering of tiles is built in to the new RasterSource interface.
@jpolchlo We definitely need buffered tiles in RF, so maybe that will be a good topic of discussion for when I visit!
I can save a model to s3 but can't load it with the following:
Documentation indicates this is the way to load and use past models: https://stackoverflow.com/questions/50587744/how-to-save-the-model-after-doing-pipeline-fit
The error is
pyspark raises this error higher in the chain:
NotImplementedError("This Java ML type cannot be loaded into Python currently
Any help is appreciated!