traviscrawford / spark-dynamodb

DynamoDB data source for Apache Spark
Apache License 2.0
95 stars 43 forks source link

Issues with pyspark 2.2 (throttling and filtering) #39

Open alfredox10 opened 6 years ago

alfredox10 commented 6 years ago

I'm moving this to a new case since I keep hitting different errors trying to use your plugin.

The command I am trying to use right now is this one:

df = spark.read.format("com.github.traviscrawford.spark.dynamodb").option("region", "us-west-2").option("table", "solr-product").load()
df.take(1)
df.count()

I've constantly kept hitting throttling errors when trying to read my table. Additionally I upped the read capacity to 5,000 and it still hits errors, I am not sure if that is the problem or something else?

It seems the command wants to load the entire database at once. So I found you also can pass server-side filter expressions, which I used but I still ran into problems. This is the filter expression I was using: df = spark.read.format("com.github.traviscrawford.spark.dynamodb").option("region", "us-west-2").option("filter_expression", "product_id = 4038609").option("table", "solr-product").load()

This is the error I got

18/01/05 20:40:23 INFO YarnClientSchedulerBackend: Application application_1515175129419_0011 has started running.
18/01/05 20:40:23 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42939.
18/01/05 20:40:23 INFO NettyBlockTransferService: Server created on 172.16.88.191:42939
18/01/05 20:40:23 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/01/05 20:40:23 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 172.16.88.191, 42939, None)
18/01/05 20:40:23 INFO BlockManagerMasterEndpoint: Registering block manager 172.16.88.191:42939 with 413.9 MB RAM, BlockManagerId(driver, 172.16.88.191, 42939, None)
18/01/05 20:40:23 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 172.16.88.191, 42939, None)
18/01/05 20:40:23 INFO BlockManager: external shuffle service port = 7337
18/01/05 20:40:23 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 172.16.88.191, 42939, None)
18/01/05 20:40:24 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1515175129419_0011
18/01/05 20:40:24 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
18/01/05 20:40:24 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
18/01/05 20:40:24 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml
18/01/05 20:40:24 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('hdfs:///user/spark/warehouse').
18/01/05 20:40:24 INFO SharedState: Warehouse path is 'hdfs:///user/spark/warehouse'.
18/01/05 20:40:24 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
18/01/05 20:40:25 WARN CredentialsLegacyConfigLocationProvider: Found the legacy config profiles file at [/home/hadoop/.aws/config]. Please move it to the latest default location [~/.aws/credentials].
18/01/05 20:40:26 INFO CodeGenerator: Code generated in 165.389524 ms
18/01/05 20:40:26 INFO SparkContext: Starting job: json at DynamoDBRelation.scala:62
18/01/05 20:40:26 INFO DAGScheduler: Got job 0 (json at DynamoDBRelation.scala:62) with 2 output partitions
18/01/05 20:40:26 INFO DAGScheduler: Final stage: ResultStage 0 (json at DynamoDBRelation.scala:62)
18/01/05 20:40:26 INFO DAGScheduler: Parents of final stage: List()
18/01/05 20:40:26 INFO DAGScheduler: Missing parents: List()
18/01/05 20:40:26 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at json at DynamoDBRelation.scala:62), which has no missing parents
18/01/05 20:40:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 9.2 KB, free 413.9 MB)
18/01/05 20:40:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.8 KB, free 413.9 MB)
18/01/05 20:40:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.88.191:42939 (size: 4.8 KB, free: 413.9 MB)
18/01/05 20:40:26 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1047
18/01/05 20:40:26 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[4] at json at DynamoDBRelation.scala:62) (first 15 tasks are for partitions Vector(0, 1))
18/01/05 20:40:26 INFO YarnScheduler: Adding task set 0.0 with 2 tasks
18/01/05 20:40:27 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 1)
18/01/05 20:40:30 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.16.88.135:57466) with ID 1
18/01/05 20:40:30 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 1)
18/01/05 20:40:30 WARN TaskSetManager: Stage 0 contains a task of very large size (623 KB). The maximum recommended task size is 100 KB.
18/01/05 20:40:30 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-172-16-88-135.us-west-2.compute.internal, executor 1, partition 0, PROCESS_LOCAL, 637966 bytes)
18/01/05 20:40:30 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, ip-172-16-88-135.us-west-2.compute.internal, executor 1, partition 1, PROCESS_LOCAL, 633356 bytes)
18/01/05 20:40:30 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-16-88-135.us-west-2.compute.internal:42633 with 2.8 GB RAM, BlockManagerId(1, ip-172-16-88-135.us-west-2.compute.internal, 42633, None)
18/01/05 20:40:30 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-16-88-135.us-west-2.compute.internal:42633 (size: 4.8 KB, free: 2.8 GB)
18/01/05 20:40:31 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1013 ms on ip-172-16-88-135.us-west-2.compute.internal (executor 1) (1/2)
18/01/05 20:40:31 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1036 ms on ip-172-16-88-135.us-west-2.compute.internal (executor 1) (2/2)
18/01/05 20:40:31 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 
18/01/05 20:40:31 INFO DAGScheduler: ResultStage 0 (json at DynamoDBRelation.scala:62) finished in 4.214 s
18/01/05 20:40:31 INFO DAGScheduler: Job 0 finished: json at DynamoDBRelation.scala:62, took 4.346569 s
18/01/05 20:40:31 INFO DynamoDBRelation: Table solr-product contains 513643 items using 1147952212 bytes.
18/01/05 20:40:31 INFO DynamoDBRelation: Schema for tableName solr-product: StructType(StructField(ColorSwatches,StringType,true), StructField(IsBeauty,StringType,true), StructField(LTSDateEnd,StringType,true), StructField(LTSDateStart,StringType,true), StructField(LTSFlag,BooleanType,true), StructField(LTSPercentOff,StringType,true), StructField(LTSPrice,StringType,true), StructField(age_code,StringType,true), StructField(age_group,StringType,true), StructField(alt_image_url,StringType,true), StructField(alternate_view_count,LongType,true), StructField(archived,BooleanType,true), StructField(associated_style_id,StringType,true), StructField(associated_style_numbers,StringType,true), StructField(available_color_count,LongType,true), StructField(average_review,DoubleType,true), StructField(brand_display_name,StringType,true), StructField(brand_id,LongType,true), StructField(brand_name,StringType,true), StructField(buy_and_save,StringType,true), StructField(classifier,StringType,true), StructField(classifier_id,StringType,true), StructField(composite_classifier,StringType,true), StructField(custom_holiday_flag,BooleanType,true), StructField(date_created,StringType,true), StructField(date_go_live,StringType,true), StructField(date_image_modified,StringType,true), StructField(date_published,StringType,true), StructField(display_photo_id,StringType,true), StructField(doc_description,StringType,true), StructField(doc_name,StringType,true), StructField(doc_type,StringType,true), StructField(execution_id,LongType,true), StructField(fit_info,StringType,true), StructField(fit_recs_available,BooleanType,true), StructField(fit_type_description,StringType,true), StructField(fulfillment_available_percentage,LongType,true), StructField(gender,StringType,true), StructField(gender_code,StringType,true), StructField(gwp,StringType,true), StructField(image_url,StringType,true), StructField(internal_anniversary_flag,BooleanType,true), StructField(inv_conf,LongType,true), StructField(is_umap_enabled,BooleanType,true), StructField(keyword,StringType,true), StructField(last_modified,StringType,true), StructField(live_status,BooleanType,true), StructField(max_msrp,StringType,true), StructField(max_percent_off,LongType,true), StructField(max_price,DoubleType,true), StructField(med_video_url,StringType,true), StructField(min_msrp,StringType,true), StructField(min_percent_off,LongType,true), StructField(min_price,DoubleType,true), StructField(path_alias,StringType,true), StructField(photos,StringType,true), StructField(product_id,LongType,true), StructField(product_uri,StringType,true), StructField(ready_status,BooleanType,true), StructField(review_count,LongType,true), StructField(sale_max_price,DoubleType,true), StructField(sale_min_price,DoubleType,true), StructField(size_chart_name,StringType,true), StructField(size_info,StringType,true), StructField(special_copy,StringType,true), StructField(style_features,StringType,true), StructField(style_num,StringType,true), StructField(style_num_alias,StringType,true), StructField(subclassifier,StringType,true), StructField(subclassifier_id,LongType,true), StructField(title,StringType,true), StructField(u_map_start_date,StringType,true), StructField(umap_end_date,StringType,true))
18/01/05 20:40:31 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
18/01/05 20:40:31 INFO ContextCleaner: Cleaned accumulator 52
18/01/05 20:40:31 INFO CodeGenerator: Code generated in 28.685884 ms
18/01/05 20:40:31 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 172.16.88.191:42939 in memory (size: 4.8 KB, free: 413.9 MB)
18/01/05 20:40:31 INFO BlockManagerInfo: Removed broadcast_0_piece0 on ip-172-16-88-135.us-west-2.compute.internal:42633 in memory (size: 4.8 KB, free: 2.8 GB)
18/01/05 20:40:31 INFO CodeGenerator: Code generated in 12.303414 ms
18/01/05 20:40:31 INFO SparkContext: Starting job: count at NativeMethodAccessorImpl.java:0
18/01/05 20:40:31 INFO DAGScheduler: Registering RDD 12 (count at NativeMethodAccessorImpl.java:0)
18/01/05 20:40:31 INFO DAGScheduler: Got job 1 (count at NativeMethodAccessorImpl.java:0) with 1 output partitions
18/01/05 20:40:31 INFO DAGScheduler: Final stage: ResultStage 2 (count at NativeMethodAccessorImpl.java:0)
18/01/05 20:40:31 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
18/01/05 20:40:31 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
18/01/05 20:40:31 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[12] at count at NativeMethodAccessorImpl.java:0), which has no missing parents
18/01/05 20:40:31 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 12.3 KB, free 413.9 MB)
18/01/05 20:40:31 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 6.2 KB, free 413.9 MB)
18/01/05 20:40:31 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.16.88.191:42939 (size: 6.2 KB, free: 413.9 MB)
18/01/05 20:40:31 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1047
18/01/05 20:40:31 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[12] at count at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
18/01/05 20:40:31 INFO YarnScheduler: Adding task set 1.0 with 1 tasks
18/01/05 20:40:31 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, ip-172-16-88-135.us-west-2.compute.internal, executor 1, partition 0, PROCESS_LOCAL, 8442 bytes)
18/01/05 20:40:31 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-16-88-135.us-west-2.compute.internal:42633 (size: 6.2 KB, free: 2.8 GB)
18/01/05 20:42:02 WARN Errors: The following warnings have been detected: WARNING: The (sub)resource method stageData in org.apache.spark.status.api.v1.OneStageResource contains empty path annotation.

18/01/05 20:42:03 WARN ServletHandler: 
javax.servlet.ServletException: java.util.NoSuchElementException: None.get
    at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
    at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
    at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
    at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
    at org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164)
    at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
    at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
    at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
    at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
    at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
    at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
    at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
    at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at org.spark_project.jetty.server.Server.handle(Server.java:524)
    at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
    at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
    at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
    at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
    at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.NoSuchElementException: None.get
    at scala.None$.get(Option.scala:347)
    at scala.None$.get(Option.scala:345)
    at org.apache.spark.status.api.v1.MetricHelper.submetricQuantiles(AllStagesResource.scala:313)
    at org.apache.spark.status.api.v1.AllStagesResource$$anon$1.build(AllStagesResource.scala:178)
    at org.apache.spark.status.api.v1.AllStagesResource$.taskMetricDistributions(AllStagesResource.scala:181)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$taskSummary$1.apply(OneStageResource.scala:71)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$taskSummary$1.apply(OneStageResource.scala:62)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$withStageAttempt$1.apply(OneStageResource.scala:130)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$withStageAttempt$1.apply(OneStageResource.scala:126)
    at org.apache.spark.status.api.v1.OneStageResource.withStage(OneStageResource.scala:97)
    at org.apache.spark.status.api.v1.OneStageResource.withStageAttempt(OneStageResource.scala:126)
    at org.apache.spark.status.api.v1.OneStageResource.taskSummary(OneStageResource.scala:62)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
    at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:205)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
    at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
    at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305)
    at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154)
    at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473)
    ... 28 more
18/01/05 20:42:03 WARN HttpChannel: //ip-172-16-88-191.us-west-2.compute.internal:4040/api/v1/applications/application_1515175129419_0011/stages/2/0/taskSummary?proxyapproved=true
javax.servlet.ServletException: java.util.NoSuchElementException: None.get
    at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
    at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
    at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
    at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
    at org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164)
    at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
    at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
    at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
    at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
    at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
    at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
    at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
    at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at org.spark_project.jetty.server.Server.handle(Server.java:524)
    at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
    at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
    at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
    at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
    at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.NoSuchElementException: None.get
    at scala.None$.get(Option.scala:347)
    at scala.None$.get(Option.scala:345)
    at org.apache.spark.status.api.v1.MetricHelper.submetricQuantiles(AllStagesResource.scala:313)
    at org.apache.spark.status.api.v1.AllStagesResource$$anon$1.build(AllStagesResource.scala:178)
    at org.apache.spark.status.api.v1.AllStagesResource$.taskMetricDistributions(AllStagesResource.scala:181)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$taskSummary$1.apply(OneStageResource.scala:71)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$taskSummary$1.apply(OneStageResource.scala:62)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$withStageAttempt$1.apply(OneStageResource.scala:130)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$withStageAttempt$1.apply(OneStageResource.scala:126)
    at org.apache.spark.status.api.v1.OneStageResource.withStage(OneStageResource.scala:97)
    at org.apache.spark.status.api.v1.OneStageResource.withStageAttempt(OneStageResource.scala:126)
    at org.apache.spark.status.api.v1.OneStageResource.taskSummary(OneStageResource.scala:62)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
    at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:205)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
    at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
    at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305)
    at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154)
    at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473)
    ... 28 more
18/01/05 20:43:28 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 177440 ms on ip-172-16-88-135.us-west-2.compute.internal (executor 1) (1/1)
18/01/05 20:43:28 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 
18/01/05 20:43:28 INFO DAGScheduler: ShuffleMapStage 1 (count at NativeMethodAccessorImpl.java:0) finished in 177.441 s
18/01/05 20:43:28 INFO DAGScheduler: looking for newly runnable stages
18/01/05 20:43:28 INFO DAGScheduler: running: Set()
18/01/05 20:43:28 INFO DAGScheduler: waiting: Set(ResultStage 2)
18/01/05 20:43:28 INFO DAGScheduler: failed: Set()
18/01/05 20:43:28 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[15] at count at NativeMethodAccessorImpl.java:0), which has no missing parents
18/01/05 20:43:28 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 7.0 KB, free 413.9 MB)
18/01/05 20:43:28 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 3.7 KB, free 413.9 MB)
18/01/05 20:43:28 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.16.88.191:42939 (size: 3.7 KB, free: 413.9 MB)
18/01/05 20:43:28 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1047
18/01/05 20:43:28 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[15] at count at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
18/01/05 20:43:28 INFO YarnScheduler: Adding task set 2.0 with 1 tasks
18/01/05 20:43:28 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 3, ip-172-16-88-135.us-west-2.compute.internal, executor 1, partition 0, NODE_LOCAL, 4737 bytes)
18/01/05 20:43:28 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-16-88-135.us-west-2.compute.internal:42633 (size: 3.7 KB, free: 2.8 GB)
18/01/05 20:43:28 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 172.16.88.135:57466
18/01/05 20:43:28 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 173 bytes
18/01/05 20:43:28 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 68 ms on ip-172-16-88-135.us-west-2.compute.internal (executor 1) (1/1)
18/01/05 20:43:28 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool 
18/01/05 20:43:28 INFO DAGScheduler: ResultStage 2 (count at NativeMethodAccessorImpl.java:0) finished in 0.069 s
18/01/05 20:43:28 INFO DAGScheduler: Job 1 finished: count at NativeMethodAccessorImpl.java:0, took 177.543464 s
0
18/01/05 20:43:29 INFO SparkContext: Invoking stop() from shutdown hook
18/01/05 20:43:29 INFO SparkUI: Stopped Spark web UI at http://ip-172-16-88-191.us-west-2.compute.internal:4040
18/01/05 20:43:29 INFO YarnClientSchedulerBackend: Interrupting monitor thread
18/01/05 20:43:29 INFO YarnClientSchedulerBackend: Shutting down all executors
18/01/05 20:43:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/01/05 20:43:29 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
18/01/05 20:43:29 INFO YarnClientSchedulerBackend: Stopped
18/01/05 20:43:29 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/01/05 20:43:29 INFO MemoryStore: MemoryStore cleared
18/01/05 20:43:29 INFO BlockManager: BlockManager stopped
18/01/05 20:43:29 INFO BlockManagerMaster: BlockManagerMaster stopped
18/01/05 20:43:29 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/01/05 20:43:29 INFO SparkContext: Successfully stopped SparkContext
18/01/05 20:43:29 INFO ShutdownHookManager: Shutdown hook called
18/01/05 20:43:29 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-7eaf86b4-d141-400e-8c40-e37045981e1c
18/01/05 20:43:29 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-7eaf86b4-d141-400e-8c40-e37045981e1c/pyspark-8b546929-ebdc-4b12-a81e-3845e181dd58
maccopper2 commented 6 years ago

I do get the same Error Message. I do not use pyspark or DynamoDB. I do use Spark with Scala on AWS EMR 5.12.0 . I do read & write in s3a.

The Bug is known in Stackoverflow (twice) and in the AWS Developer Forum.

There is no known solution, only:

Regards, Jan

PS: My Stacktrace was:

18/03/16 15:20:40 WARN ProcessXML: Writing to path 's3a://<bucket & Path removed. Bucket is in Frankfurt>' with header false t:main
18/03/16 15:20:53 WARN ServletHandler:  t:SparkUI-44
javax.servlet.ServletException: java.util.NoSuchElementException: None.get
    at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
    at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
    at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
    at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
    at org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171)
    at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
    at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
    at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
    at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
    at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
    at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
    at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
    at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at org.spark_project.jetty.server.Server.handle(Server.java:524)
    at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
    at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
    at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
    at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
    at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.NoSuchElementException: None.get
    at scala.None$.get(Option.scala:347)
    at scala.None$.get(Option.scala:345)
    at org.apache.spark.status.api.v1.MetricHelper.submetricQuantiles(AllStagesResource.scala:313)
    at org.apache.spark.status.api.v1.AllStagesResource$$anon$1.build(AllStagesResource.scala:178)
    at org.apache.spark.status.api.v1.AllStagesResource$.taskMetricDistributions(AllStagesResource.scala:181)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$taskSummary$1.apply(OneStageResource.scala:71)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$taskSummary$1.apply(OneStageResource.scala:62)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$withStageAttempt$1.apply(OneStageResource.scala:130)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$withStageAttempt$1.apply(OneStageResource.scala:126)
    at org.apache.spark.status.api.v1.OneStageResource.withStage(OneStageResource.scala:97)
    at org.apache.spark.status.api.v1.OneStageResource.withStageAttempt(OneStageResource.scala:126)
    at org.apache.spark.status.api.v1.OneStageResource.taskSummary(OneStageResource.scala:62)
    at sun.reflect.GeneratedMethodAccessor198.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
    at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:205)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
    at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
    at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305)
    at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154)
    at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473)
    ... 28 more
18/03/16 15:20:53 WARN HttpChannel: //ip-172-31-5-253.eu-central-1.compute.internal:4040/api/v1/applications/application_1521213166704_0001/stages/68/0/taskSummary?proxyapproved=true t:SparkUI-44
javax.servlet.ServletException: java.util.NoSuchElementException: None.get
    at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
    at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
    at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
    at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
    at org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171)
    at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
    at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
    at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
    at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
    at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
    at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
    at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
    at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at org.spark_project.jetty.server.Server.handle(Server.java:524)
    at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
    at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
    at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
    at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
    at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.NoSuchElementException: None.get
    at scala.None$.get(Option.scala:347)
    at scala.None$.get(Option.scala:345)
    at org.apache.spark.status.api.v1.MetricHelper.submetricQuantiles(AllStagesResource.scala:313)
    at org.apache.spark.status.api.v1.AllStagesResource$$anon$1.build(AllStagesResource.scala:178)
    at org.apache.spark.status.api.v1.AllStagesResource$.taskMetricDistributions(AllStagesResource.scala:181)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$taskSummary$1.apply(OneStageResource.scala:71)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$taskSummary$1.apply(OneStageResource.scala:62)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$withStageAttempt$1.apply(OneStageResource.scala:130)
    at org.apache.spark.status.api.v1.OneStageResource$$anonfun$withStageAttempt$1.apply(OneStageResource.scala:126)
    at org.apache.spark.status.api.v1.OneStageResource.withStage(OneStageResource.scala:97)
    at org.apache.spark.status.api.v1.OneStageResource.withStageAttempt(OneStageResource.scala:126)
    at org.apache.spark.status.api.v1.OneStageResource.taskSummary(OneStageResource.scala:62)
    at sun.reflect.GeneratedMethodAccessor198.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
    at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:205)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
    at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
    at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305)
    at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154)
    at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473)
    ... 28 more