microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.05k stars 830 forks source link

mmlspark 0.17 : java.lang.NullPointerException: Value at index 4 is null #679

Open jaipal1995 opened 5 years ago

jaipal1995 commented 5 years ago

I have been facing NullPointerException mentioned above followed by 'Connection refused' error. It seems like connections are being closed to executors before finishing all the tasks. Can you please look into this?

I am running in cluster mode with spark-submit command and dynamic allocation enabled. Stacktrace

2019-08-30 22:07:55 INFO  FairSchedulableBuilder:54 - Added task set TaskSet_183.0 tasks to pool default
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 0.0 in stage 183.0 (TID 6374, JMNGDBAE230C07, executor 9, partition 0, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 1.0 in stage 183.0 (TID 6375, JMNGD1BAE210V10, executor 5, partition 1, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 2.0 in stage 183.0 (TID 6376, JMNGD1BAF070C08, executor 1, partition 2, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 3.0 in stage 183.0 (TID 6377, JMNGD1BAE050V03, executor 4, partition 3, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 4.0 in stage 183.0 (TID 6378, JMNGD1BAE050V03, executor 16, partition 4, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 5.0 in stage 183.0 (TID 6379, JMNGD1BAE210V10, executor 11, partition 5, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 6.0 in stage 183.0 (TID 6380, JMNGD1BAE170V10, executor 7, partition 6, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 7.0 in stage 183.0 (TID 6381, JMNGD1BAE050V03, executor 10, partition 7, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 8.0 in stage 183.0 (TID 6382, JMNGD1BAE050V04, executor 6, partition 8, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 9.0 in stage 183.0 (TID 6383, JMNGD1BAG070C09, executor 20, partition 9, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 10.0 in stage 183.0 (TID 6384, JMNGD1BAE050V04, executor 18, partition 10, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 11.0 in stage 183.0 (TID 6385, JMNGD1BAE170V10, executor 19, partition 11, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 12.0 in stage 183.0 (TID 6386, JMNGD1BAE170V10, executor 13, partition 12, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 13.0 in stage 183.0 (TID 6387, JMNGD1BAE050V04, executor 12, partition 13, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 14.0 in stage 183.0 (TID 6388, JMNGD1BAG70C07, executor 2, partition 14, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 15.0 in stage 183.0 (TID 6389, JMNGDBAE230C07, executor 15, partition 15, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 16.0 in stage 183.0 (TID 6390, JMNGD1BAE050V03, executor 3, partition 16, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 17.0 in stage 183.0 (TID 6391, JMNGD1BAG070C09, executor 14, partition 17, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 18.0 in stage 183.0 (TID 6392, JMNGD1BAE210V10, executor 17, partition 18, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Starting task 19.0 in stage 183.0 (TID 6393, JMNGD1BAG070C08, executor 8, partition 19, PROCESS_LOCAL, 7730 bytes)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAG70C07:39516 (size: 4.6 KB, free: 1685.1 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAF070C08:35074 (size: 4.6 KB, free: 1684.9 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAE210V10:35229 (size: 4.6 KB, free: 1677.6 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAE050V04:34487 (size: 4.6 KB, free: 1685.0 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAE050V04:33947 (size: 4.6 KB, free: 1685.1 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAE170V10:38038 (size: 4.6 KB, free: 1685.1 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAE050V04:32928 (size: 4.6 KB, free: 1685.0 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAE210V10:35889 (size: 4.6 KB, free: 1683.4 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAE170V10:35078 (size: 4.6 KB, free: 1685.1 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGDBAE230C07:45082 (size: 4.6 KB, free: 1685.0 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAE170V10:41743 (size: 4.6 KB, free: 1685.0 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAE050V03:45574 (size: 4.6 KB, free: 1592.4 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAE050V03:43273 (size: 4.6 KB, free: 1607.9 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAE210V10:43736 (size: 4.6 KB, free: 1683.0 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGDBAE230C07:44691 (size: 4.6 KB, free: 1685.1 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAE050V03:38066 (size: 4.6 KB, free: 1589.7 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAE050V03:34388 (size: 4.6 KB, free: 1585.7 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAG070C08:42414 (size: 4.6 KB, free: 1685.1 MB)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAG070C09:37013 (size: 4.6 KB, free: 1685.1 MB)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Finished task 14.0 in stage 183.0 (TID 6388) in 41 ms on JMNGD1BAG70C07 (executor 2) (1/20)
2019-08-30 22:07:55 INFO  BlockManagerInfo:54 - Added broadcast_123_piece0 in memory on JMNGD1BAG070C09:34268 (size: 4.6 KB, free: 1685.1 MB)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Finished task 2.0 in stage 183.0 (TID 6376) in 67 ms on JMNGD1BAF070C08 (executor 1) (2/20)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Finished task 1.0 in stage 183.0 (TID 6375) in 69 ms on JMNGD1BAE210V10 (executor 5) (3/20)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Finished task 10.0 in stage 183.0 (TID 6384) in 52 ms on JMNGD1BAE050V04 (executor 18) (4/20)
2019-08-30 22:07:55 INFO  TaskSetManager:54 - Finished task 13.0 in stage 183.0 (TID 6387) in 45 ms on JMNGD1BAE050V04 (executor 12) (5/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 12.0 in stage 183.0 (TID 6386) in 48 ms on JMNGD1BAE170V10 (executor 13) (6/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 8.0 in stage 183.0 (TID 6382) in 58 ms on JMNGD1BAE050V04 (executor 6) (7/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 18.0 in stage 183.0 (TID 6392) in 36 ms on JMNGD1BAE210V10 (executor 17) (8/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 6.0 in stage 183.0 (TID 6380) in 62 ms on JMNGD1BAE170V10 (executor 7) (9/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 11.0 in stage 183.0 (TID 6385) in 52 ms on JMNGD1BAE170V10 (executor 19) (10/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 3.0 in stage 183.0 (TID 6377) in 69 ms on JMNGD1BAE050V03 (executor 4) (11/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 15.0 in stage 183.0 (TID 6389) in 52 ms on JMNGDBAE230C07 (executor 15) (12/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 0.0 in stage 183.0 (TID 6374) in 84 ms on JMNGDBAE230C07 (executor 9) (13/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 16.0 in stage 183.0 (TID 6390) in 51 ms on JMNGD1BAE050V03 (executor 3) (14/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 4.0 in stage 183.0 (TID 6378) in 77 ms on JMNGD1BAE050V03 (executor 16) (15/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 9.0 in stage 183.0 (TID 6383) in 67 ms on JMNGD1BAG070C09 (executor 20) (16/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 7.0 in stage 183.0 (TID 6381) in 71 ms on JMNGD1BAE050V03 (executor 10) (17/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 19.0 in stage 183.0 (TID 6393) in 45 ms on JMNGD1BAG070C08 (executor 8) (18/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 5.0 in stage 183.0 (TID 6379) in 76 ms on JMNGD1BAE210V10 (executor 11) (19/20)
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Finished task 17.0 in stage 183.0 (TID 6391) in 52 ms on JMNGD1BAG070C09 (executor 14) (20/20)
2019-08-30 22:07:56 INFO  YarnClusterScheduler:54 - Removed TaskSet 183.0, whose tasks have all completed, from pool default
2019-08-30 22:07:56 INFO  DAGScheduler:54 - ResultStage 183 (collect at LightGBMUtils.scala:183) finished in 0.095 s
2019-08-30 22:07:56 INFO  DAGScheduler:54 - Job 75 finished: collect at LightGBMUtils.scala:183, took 0.097195 s
2019-08-30 22:07:56 INFO  LightGBMRegressor:110 - driver expecting 1 connections...
2019-08-30 22:07:56 INFO  LightGBMRegressor:112 - driver accepting a new connection...
2019-08-30 22:07:56 INFO  LightGBMRegressor:140 - driver waiting for connections on host: 10.144.96.175 and port: 35258
2019-08-30 22:07:56 INFO  LightGBMRegressor:42 - LightGBM parameters: alpha=0.9 tweedie_variance_power=1.5 is_pre_partition=True boosting_type=gbdt tree_learner=data_parallel num_iterations=500 learning_rate=0.1 num_leaves=5 max_bin=255 bagging_fraction=0.8999999761581421 bagging_freq=0 bagging_seed=10011993 early_stopping_round=50 feature_fraction=0.8 max_depth=4 min_sum_hessian_in_leaf=10.0 num_machines=1 objective=regression verbosity=3 boost_from_average=true lambda_l1=0.0 lambda_l2=0.0 
2019-08-30 22:07:56 INFO  SparkContext:54 - Starting job: reduce at LightGBMBase.scala:51
2019-08-30 22:07:56 INFO  DAGScheduler:54 - Got job 76 (reduce at LightGBMBase.scala:51) with 1 output partitions
2019-08-30 22:07:56 INFO  DAGScheduler:54 - Final stage: ResultStage 187 (reduce at LightGBMBase.scala:51)
2019-08-30 22:07:56 INFO  DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 186)
2019-08-30 22:07:56 INFO  DAGScheduler:54 - Missing parents: List()
2019-08-30 22:07:56 INFO  DAGScheduler:54 - Submitting ResultStage 187 (MapPartitionsRDD[709] at reduce at LightGBMBase.scala:51), which has no missing parents
2019-08-30 22:07:56 INFO  MemoryStore:54 - Block broadcast_124 stored as values in memory (estimated size 26.1 KB, free 3.5 GB)
2019-08-30 22:07:56 INFO  MemoryStore:54 - Block broadcast_124_piece0 stored as bytes in memory (estimated size 12.2 KB, free 3.5 GB)
2019-08-30 22:07:56 INFO  BlockManagerInfo:54 - Added broadcast_124_piece0 in memory on JMNGD1BAE210V10:36083 (size: 12.2 KB, free: 3.6 GB)
2019-08-30 22:07:56 INFO  SparkContext:54 - Created broadcast 124 from broadcast at DAGScheduler.scala:1039
2019-08-30 22:07:56 INFO  DAGScheduler:54 - Submitting 1 missing tasks from ResultStage 187 (MapPartitionsRDD[709] at reduce at LightGBMBase.scala:51) (first 15 tasks are for partitions Vector(0))
2019-08-30 22:07:56 INFO  YarnClusterScheduler:54 - Adding task set 187.0 with 1 tasks
2019-08-30 22:07:56 INFO  FairSchedulableBuilder:54 - Added task set TaskSet_187.0 tasks to pool default
2019-08-30 22:07:56 INFO  TaskSetManager:54 - Starting task 0.0 in stage 187.0 (TID 6394, JMNGD1BAE210V10, executor 11, partition 0, NODE_LOCAL, 8128 bytes)
2019-08-30 22:07:56 INFO  BlockManagerInfo:54 - Added broadcast_124_piece0 in memory on JMNGD1BAE210V10:43736 (size: 12.2 KB, free: 1683.0 MB)
2019-08-30 22:07:56 INFO  LightGBMRegressor:120 - driver received socket from worker: 10.144.96.175:12488
2019-08-30 22:07:56 INFO  LightGBMRegressor:127 - driver writing back to all connections: 10.144.96.175:12488
2019-08-30 22:07:56 INFO  LightGBMRegressor:134 - driver closing all sockets and server socket
2019-08-30 22:07:56 WARN  TaskSetManager:66 - Lost task 0.0 in stage 187.0 (TID 6394, JMNGD1BAE210V10, executor 11): java.lang.NullPointerException: Value at index 4 is null
    at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
    at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
    at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:166)
    at com.microsoft.ml.spark.TrainUtils$$anonfun$generateDataset$1$$anonfun$6.apply(TrainUtils.scala:64)
    at com.microsoft.ml.spark.TrainUtils$$anonfun$generateDataset$1$$anonfun$6.apply(TrainUtils.scala:64)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
    at com.microsoft.ml.spark.TrainUtils$.generateDataset(TrainUtils.scala:63)
    at com.microsoft.ml.spark.TrainUtils$.translate(TrainUtils.scala:168)
    at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:316)
    at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49)
    at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49)
    at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188)
    at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

If the bug pertains to a specific feature please tag the appropriate CODEOWNER for better visibility

Additional context Add any other context about the problem here.

welcome[bot] commented 5 years ago

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

jaipal1995 commented 5 years ago

Hi, After going through code, I realized, the error came if 'weightCol' has null values in it. Would have been better if it could have been handled in a better way. I got the solution for the same. Still wondering, Why I got 'connection refused' error after that? It would be great if anyone could explain.

imatiach-msft commented 5 years ago

hi @jaipal1995 , the connection refused error just means some of the workers timed out, it's usually hiding the real error and is a red herring. Usually this happens if some workers failed, or exiting early, then the other workers may get stuck on the next iteration and eventually time out. It's on my todo list to try and improve this. After handling the null values, are you still seeing the connection refused error?

jaipal1995 commented 5 years ago

Hi @imatiach-msft , No. I am not getting any error after handling the null values. Thanks.

jaipal1995 commented 5 years ago

Hi @imatiach-msft , In the code, I see, the training dataframe is partitioned to number of executor cores available for the spark application to make it run efficiently( assuming there is only one model training at a time). In my scenario, there are several models trained at a time using parArray. Do you have any suggestions on this scenario for optimization?

jaipal1995 commented 5 years ago

Hi @imatiach-msft , I am getting the following error again after fixing the previous issue. Also, there was one more error : ' java.lang.unsatisfiedlinkerror: lib_lightgbm.so: failed to map segment from shared object' . I have tried a few options. Please suggest some solution asap. Thanks. User class threw exception: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$6.apply(CrossValidator.scala:164) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$6.apply(CrossValidator.scala:164) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4.apply(CrossValidator.scala:164) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4.apply(CrossValidator.scala:144) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:144) at ai.couture.obelisk.retail.usecases.plpAnalysis.SalesPrediction$$anonfun$doTransformations$1$$anonfun$apply$11$$anonfun$apply$12.apply(SalesPrediction.scala:354) at ai.couture.obelisk.retail.usecases.plpAnalysis.SalesPrediction$$anonfun$doTransformations$1$$anonfun$apply$11$$anonfun$apply$12.apply(SalesPrediction.scala:340) at scala.collection.parallel.mutable.ParArray$ParArrayIterator.foreach_quick(ParArray.scala:143) at scala.collection.parallel.mutable.ParArray$ParArrayIterator.foreach(ParArray.scala:136) at scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:972) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51) at scala.collection.parallel.ParIterableLike$Foreach.tryLeaf(ParIterableLike.scala:969) at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:159) at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:443) at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:149) at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443) at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 149 in stage 1802.0 failed 4 times, most recent failure: Lost task 149.3 in stage 1802.0 (TID 183416, JMNGD1BAG70C07, executor 14): java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.<init>(Socket.java:434) at java.net.Socket.<init>(Socket.java:211) at com.microsoft.ml.spark.TrainUtils$.getNetworkInitNodes(TrainUtils.scala:240) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:303) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:298) at com.microsoft.ml.spark.StreamUtilities$.using(StreamUtilities.scala:29) at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:297) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2131) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1029) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1011) at org.apache.spark.sql.Dataset.reduce(Dataset.scala:1610) at com.microsoft.ml.spark.LightGBMBase$class.train(LightGBMBase.scala:51) at com.microsoft.ml.spark.LightGBMRegressor.train(LightGBMRegressor.scala:35) at com.microsoft.ml.spark.LightGBMRegressor.train(LightGBMRegressor.scala:35) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118) at org.apache.spark.ml.Predictor.fit(Predictor.scala:82) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44) at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:96) at org.apache.spark.ml.Estimator.fit(Estimator.scala:61) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5$$anonfun$apply$1.apply$mcD$sp(CrossValidator.scala:152) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5$$anonfun$apply$1.apply(CrossValidator.scala:151) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5$$anonfun$apply$1.apply(CrossValidator.scala:151) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) at scala.concurrent.impl.Future$.apply(Future.scala:31) at scala.concurrent.Future$.apply(Future.scala:494) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5.apply(CrossValidator.scala:160) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5.apply(CrossValidator.scala:150) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4.apply(CrossValidator.scala:150) ... 27 more Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.<init>(Socket.java:434) at java.net.Socket.<init>(Socket.java:211) at com.microsoft.ml.spark.TrainUtils$.getNetworkInitNodes(TrainUtils.scala:240) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:303) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:298) at com.microsoft.ml.spark.StreamUtilities$.using(StreamUtilities.scala:29) at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:297) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

jaipal1995 commented 5 years ago

2019-09-04 17:25:44 INFO BlockManager:54 - Found block rdd_13321_52 remotely 2019-09-04 17:25:44 INFO BlockManager:54 - Found block rdd_13321_196 remotely 2019-09-04 17:25:44 INFO LightGBMRegressor:233 - Successfully bound to port 12472 2019-09-04 17:25:44 INFO LightGBMRegressor:302 - LightGBM worker connecting to host: 10.144.96.175 and port: 38036 2019-09-04 17:25:44 INFO LightGBMRegressor:249 - send empty status to driver 2019-09-04 17:25:44 WARN LightGBMRegressor:307 - LightGBM worker encountered empty partition, for best performance ensure no partitions empty 2019-09-04 17:25:44 INFO Executor:54 - Finished task 58.1 in stage 2542.0 (TID 235941). 8226 bytes result sent to driver 2019-09-04 17:25:44 INFO CoarseGrainedExecutorBackend:54 - Got assigned task 235944 2019-09-04 17:25:44 INFO Executor:54 - Running task 78.0 in stage 2546.0 (TID 235944) 2019-09-04 17:25:44 INFO BlockManager:54 - Found block rdd_13241_72 locally 2019-09-04 17:25:44 INFO LightGBMRegressor:233 - Successfully bound to port 12472 2019-09-04 17:25:45 ERROR Executor:91 - Exception in task 78.0 in stage 2546.0 (TID 235944) java.lang.UnsatisfiedLinkError: /mysqldata/HadoopData/nm-local-dir/usercache/couture/appcache/application_1567507960273_0008/container_1567507960273_0008_02_000011/tmp/mml-natives8004835678613750376/lib_lightgbm_swig.so: /mysqldata/HadoopData/nm-local-dir/usercache/couture/appcache/application_1567507960273_0008/container_1567507960273_0008_02_000011/tmp/mml-natives8004835678613750376/lib_lightgbm_swig.so: failed to map segment from shared object at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824) at java.lang.Runtime.load0(Runtime.java:809) at java.lang.System.load(System.java:1086) at com.microsoft.ml.spark.NativeLoader.loadLibraryByName(NativeLoader.java:59) at com.microsoft.ml.spark.LightGBMUtils$.initializeNativeLibrary(LightGBMUtils.scala:39) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:301) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:298) at com.microsoft.ml.spark.StreamUtilities$.using(StreamUtilities.scala:29) at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:297) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 262144 bytes for committing reserved memory. # An error report file with more information is saved as: # /mysqldata/HadoopData/nm-local-dir/usercache/couture/appcache/application_1567507960273_0008/container_1567507960273_0008_02_000011/hs_err_pid14837.log``

jaipal1995 commented 5 years ago

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2208.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2208.0 (TID 217002, JMNGD1BAE170V10, executor 6): java.lang.UnsatisfiedLinkError: /data/HadoopData/nm-local-dir/usercache/couture/appcache/application_1567507960273_0008/container_1567507960273_0008_02_000008/tmp/mml-natives4523911706023596553/lib_lightgbm.so: /data/HadoopData/nm-local-dir/usercache/couture/appcache/application_1567507960273_0008/container_1567507960273_0008_02_000008/tmp/mml-natives4523911706023596553/lib_lightgbm.so: cannot apply additional memory protection after relocation: Cannot allocate memory at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824) at java.lang.Runtime.load0(Runtime.java:809) at java.lang.System.load(System.java:1086) at com.microsoft.ml.spark.NativeLoader.loadLibraryByName(NativeLoader.java:59) at com.microsoft.ml.spark.LightGBMUtils$.initializeNativeLibrary(LightGBMUtils.scala:38) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:301) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:298) at com.microsoft.ml.spark.StreamUtilities$.using(StreamUtilities.scala:29) at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:297) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2131) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1029) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1011) at org.apache.spark.sql.Dataset.reduce(Dataset.scala:1610) at com.microsoft.ml.spark.LightGBMBase$class.train(LightGBMBase.scala:51) at com.microsoft.ml.spark.LightGBMRegressor.train(LightGBMRegressor.scala:35) at com.microsoft.ml.spark.LightGBMRegressor.train(LightGBMRegressor.scala:35) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118) at org.apache.spark.ml.Predictor.fit(Predictor.scala:82) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44) at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:96) at org.apache.spark.ml.Estimator.fit(Estimator.scala:61) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5$$anonfun$apply$1.apply$mcD$sp(CrossValidator.scala:152) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5$$anonfun$apply$1.apply(CrossValidator.scala:151) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5$$anonfun$apply$1.apply(CrossValidator.scala:151) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) at scala.concurrent.impl.Future$.apply(Future.scala:31) at scala.concurrent.Future$.apply(Future.scala:494) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5.apply(CrossValidator.scala:160) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5.apply(CrossValidator.scala:150) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4.apply(CrossValidator.scala:150) ... 27 more Caused by: java.lang.UnsatisfiedLinkError: /data/HadoopData/nm-local-dir/usercache/couture/appcache/application_1567507960273_0008/container_1567507960273_0008_02_000008/tmp/mml-natives4523911706023596553/lib_lightgbm.so: /data/HadoopData/nm-local-dir/usercache/couture/appcache/application_1567507960273_0008/container_1567507960273_0008_02_000008/tmp/mml-natives4523911706023596553/lib_lightgbm.so: cannot apply additional memory protection after relocation: Cannot allocate memory at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824) at java.lang.Runtime.load0(Runtime.java:809) at java.lang.System.load(System.java:1086) at com.microsoft.ml.spark.NativeLoader.loadLibraryByName(NativeLoader.java:59) at com.microsoft.ml.spark.LightGBMUtils$.initializeNativeLibrary(LightGBMUtils.scala:38) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:301) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:298) at com.microsoft.ml.spark.StreamUtilities$.using(StreamUtilities.scala:29) at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:297) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

imatiach-msft commented 5 years ago

@jaipal1995 "lib_lightgbm.so: cannot apply additional memory protection after relocation: Cannot allocate memory at java.lang.ClassLoader$NativeLibrary.load"

it looks like an out of memory error? How large is your dataset (num rows, num cols)? How many workers do you have in your cluster and what are their specifications? What are the driver's specifications (as opposed to workers)?

"In my scenario, there are several models trained at a time using parArray. Do you have any suggestions on this scenario for optimization?" I've sometimes seen users have issues when training more than one model in parallel. I probably need to investigate more to understand what might be happening there.

jaipal1995 commented 5 years ago

Hi @imatiach-msft , I have a dataframe with around 300 something called Brickids. I have to train a model for each brickid data (currently training for each brickid sequentially) and brickid training data can have rows number approximately varying from 100 to 50000 and around 42 columns. Cluster Configuration: maximum executors for an application is 24, each executor with 4GB RAM and 1 core, and the driver has 6GB RAM and 1 core. I am using GridSearch which has a total of 48 combinations and cross Validator with 3 folds. Hence, 48*3 different models have to be trained and this is how my LightGBMRegressor object and looks like : ` val lightGBMRegressor: LightGBMRegressor = new LightGBMRegressor() .setLabelCol("label") .setFeaturesCol("features") .setObjective("regression") .setMinSumHessianInLeaf(10) .setVerbosity(3) .setBaggingFraction(0.9f) .setFeatureFraction(0.8) .setEarlyStoppingRound(50) .setBaggingSeed(10011993) .setNumLeaves(10) .setPredictionCol("prediction") .setBoostingType("gbdt") .setWeightCol("logWeights")

val paramGrid: Array[ParamMap] = new ParamGridBuilder() .addGrid(lightGBMRegressor.maxDepth, Array(4, 6, 8)) .addGrid(lightGBMRegressor.learningRate, Array(0.05, 0.1)) .addGrid(lightGBMRegressor.numLeaves, Array(5, 10, 20, 40)) .addGrid(lightGBMRegressor.numIterations, Array(200, 500)) .build()

val pipeline: Pipeline = new Pipeline().setStages(Array(lightGBMRegressor))

val cv: CrossValidator = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(3) .setParallelism(24)`

Kindly, correct me if I am doing anything wrong at any point

jaipal1995 commented 5 years ago

the reduce jobs are taking a lot of time, only one executor is having the data: Attached the screenshots, if they could be of any help.

Screenshot 2019-09-13 at 2 44 49 PM Screenshot 2019-09-13 at 2 42 16 PM
jaipal1995 commented 5 years ago

Hi @imatiach-msft , Any updates on this?

imatiach-msft commented 5 years ago

@jaipal1995 sorry about the trouble you are having. Agree, it looks like there is a bug with doing hyperparameter tuning (multiple jobs in parallel) with lightgbm. I think I might be able to reproduce the issue myself, but if your dataset is public and you could send it to me with the code to reproduce the exact issue you are having it would make it easier for me to diagnose the problem you are having. If you are on a tight deadline though I wouldn't recommend waiting on me, I would try out either sequentially creating the lightgbm models or using GBTClassifier from SparkML. At the same time if you could give me a repro that would help me fix the issue you are having or at least identify the exact problem.

It does seem that all of the data is being put on one executor and that may explain the out of memory errors that you are getting. I'm not sure about the cause however, and I can't quite tell what the issue might be from the logs above.

jaipal1995 commented 5 years ago

@imatiach-msft , Where do I share the data and code? If you have your mail id , kindly share.

imatiach-msft commented 5 years ago

hi @jaipal1995 , you can send the data and code to: ilmat at microsoft dot com otherwise, you can also just share it on github (eg a link to the dataset stored on GitHub or some server location), as long as it is not private