Open jaipal1995 opened 5 years ago
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.
Hi, After going through code, I realized, the error came if 'weightCol' has null values in it. Would have been better if it could have been handled in a better way. I got the solution for the same. Still wondering, Why I got 'connection refused' error after that? It would be great if anyone could explain.
hi @jaipal1995 , the connection refused error just means some of the workers timed out, it's usually hiding the real error and is a red herring. Usually this happens if some workers failed, or exiting early, then the other workers may get stuck on the next iteration and eventually time out. It's on my todo list to try and improve this. After handling the null values, are you still seeing the connection refused error?
Hi @imatiach-msft , No. I am not getting any error after handling the null values. Thanks.
Hi @imatiach-msft , In the code, I see, the training dataframe is partitioned to number of executor cores available for the spark application to make it run efficiently( assuming there is only one model training at a time). In my scenario, there are several models trained at a time using parArray. Do you have any suggestions on this scenario for optimization?
Hi @imatiach-msft ,
I am getting the following error again after fixing the previous issue. Also, there was one more error : ' java.lang.unsatisfiedlinkerror: lib_lightgbm.so: failed to map segment from shared object' . I have tried a few options. Please suggest some solution asap. Thanks.
User class threw exception: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$6.apply(CrossValidator.scala:164) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$6.apply(CrossValidator.scala:164) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4.apply(CrossValidator.scala:164) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4.apply(CrossValidator.scala:144) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:144) at ai.couture.obelisk.retail.usecases.plpAnalysis.SalesPrediction$$anonfun$doTransformations$1$$anonfun$apply$11$$anonfun$apply$12.apply(SalesPrediction.scala:354) at ai.couture.obelisk.retail.usecases.plpAnalysis.SalesPrediction$$anonfun$doTransformations$1$$anonfun$apply$11$$anonfun$apply$12.apply(SalesPrediction.scala:340) at scala.collection.parallel.mutable.ParArray$ParArrayIterator.foreach_quick(ParArray.scala:143) at scala.collection.parallel.mutable.ParArray$ParArrayIterator.foreach(ParArray.scala:136) at scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:972) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51) at scala.collection.parallel.ParIterableLike$Foreach.tryLeaf(ParIterableLike.scala:969) at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:159) at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:443) at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:149) at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443) at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 149 in stage 1802.0 failed 4 times, most recent failure: Lost task 149.3 in stage 1802.0 (TID 183416, JMNGD1BAG70C07, executor 14): java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.<init>(Socket.java:434) at java.net.Socket.<init>(Socket.java:211) at com.microsoft.ml.spark.TrainUtils$.getNetworkInitNodes(TrainUtils.scala:240) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:303) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:298) at com.microsoft.ml.spark.StreamUtilities$.using(StreamUtilities.scala:29) at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:297) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2131) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1029) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1011) at org.apache.spark.sql.Dataset.reduce(Dataset.scala:1610) at com.microsoft.ml.spark.LightGBMBase$class.train(LightGBMBase.scala:51) at com.microsoft.ml.spark.LightGBMRegressor.train(LightGBMRegressor.scala:35) at com.microsoft.ml.spark.LightGBMRegressor.train(LightGBMRegressor.scala:35) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118) at org.apache.spark.ml.Predictor.fit(Predictor.scala:82) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44) at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:96) at org.apache.spark.ml.Estimator.fit(Estimator.scala:61) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5$$anonfun$apply$1.apply$mcD$sp(CrossValidator.scala:152) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5$$anonfun$apply$1.apply(CrossValidator.scala:151) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5$$anonfun$apply$1.apply(CrossValidator.scala:151) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) at scala.concurrent.impl.Future$.apply(Future.scala:31) at scala.concurrent.Future$.apply(Future.scala:494) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5.apply(CrossValidator.scala:160) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5.apply(CrossValidator.scala:150) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4.apply(CrossValidator.scala:150) ... 27 more Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.<init>(Socket.java:434) at java.net.Socket.<init>(Socket.java:211) at com.microsoft.ml.spark.TrainUtils$.getNetworkInitNodes(TrainUtils.scala:240) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:303) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:298) at com.microsoft.ml.spark.StreamUtilities$.using(StreamUtilities.scala:29) at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:297) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
2019-09-04 17:25:44 INFO BlockManager:54 - Found block rdd_13321_52 remotely 2019-09-04 17:25:44 INFO BlockManager:54 - Found block rdd_13321_196 remotely 2019-09-04 17:25:44 INFO LightGBMRegressor:233 - Successfully bound to port 12472 2019-09-04 17:25:44 INFO LightGBMRegressor:302 - LightGBM worker connecting to host: 10.144.96.175 and port: 38036 2019-09-04 17:25:44 INFO LightGBMRegressor:249 - send empty status to driver 2019-09-04 17:25:44 WARN LightGBMRegressor:307 - LightGBM worker encountered empty partition, for best performance ensure no partitions empty 2019-09-04 17:25:44 INFO Executor:54 - Finished task 58.1 in stage 2542.0 (TID 235941). 8226 bytes result sent to driver 2019-09-04 17:25:44 INFO CoarseGrainedExecutorBackend:54 - Got assigned task 235944 2019-09-04 17:25:44 INFO Executor:54 - Running task 78.0 in stage 2546.0 (TID 235944) 2019-09-04 17:25:44 INFO BlockManager:54 - Found block rdd_13241_72 locally 2019-09-04 17:25:44 INFO LightGBMRegressor:233 - Successfully bound to port 12472 2019-09-04 17:25:45 ERROR Executor:91 - Exception in task 78.0 in stage 2546.0 (TID 235944) java.lang.UnsatisfiedLinkError: /mysqldata/HadoopData/nm-local-dir/usercache/couture/appcache/application_1567507960273_0008/container_1567507960273_0008_02_000011/tmp/mml-natives8004835678613750376/lib_lightgbm_swig.so: /mysqldata/HadoopData/nm-local-dir/usercache/couture/appcache/application_1567507960273_0008/container_1567507960273_0008_02_000011/tmp/mml-natives8004835678613750376/lib_lightgbm_swig.so: failed to map segment from shared object at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824) at java.lang.Runtime.load0(Runtime.java:809) at java.lang.System.load(System.java:1086) at com.microsoft.ml.spark.NativeLoader.loadLibraryByName(NativeLoader.java:59) at com.microsoft.ml.spark.LightGBMUtils$.initializeNativeLibrary(LightGBMUtils.scala:39) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:301) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:298) at com.microsoft.ml.spark.StreamUtilities$.using(StreamUtilities.scala:29) at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:297) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 262144 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /mysqldata/HadoopData/nm-local-dir/usercache/couture/appcache/application_1567507960273_0008/container_1567507960273_0008_02_000011/hs_err_pid14837.log``
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2208.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2208.0 (TID 217002, JMNGD1BAE170V10, executor 6): java.lang.UnsatisfiedLinkError: /data/HadoopData/nm-local-dir/usercache/couture/appcache/application_1567507960273_0008/container_1567507960273_0008_02_000008/tmp/mml-natives4523911706023596553/lib_lightgbm.so: /data/HadoopData/nm-local-dir/usercache/couture/appcache/application_1567507960273_0008/container_1567507960273_0008_02_000008/tmp/mml-natives4523911706023596553/lib_lightgbm.so: cannot apply additional memory protection after relocation: Cannot allocate memory at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824) at java.lang.Runtime.load0(Runtime.java:809) at java.lang.System.load(System.java:1086) at com.microsoft.ml.spark.NativeLoader.loadLibraryByName(NativeLoader.java:59) at com.microsoft.ml.spark.LightGBMUtils$.initializeNativeLibrary(LightGBMUtils.scala:38) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:301) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:298) at com.microsoft.ml.spark.StreamUtilities$.using(StreamUtilities.scala:29) at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:297) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2131) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1029) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1011) at org.apache.spark.sql.Dataset.reduce(Dataset.scala:1610) at com.microsoft.ml.spark.LightGBMBase$class.train(LightGBMBase.scala:51) at com.microsoft.ml.spark.LightGBMRegressor.train(LightGBMRegressor.scala:35) at com.microsoft.ml.spark.LightGBMRegressor.train(LightGBMRegressor.scala:35) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118) at org.apache.spark.ml.Predictor.fit(Predictor.scala:82) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153) at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44) at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:96) at org.apache.spark.ml.Estimator.fit(Estimator.scala:61) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5$$anonfun$apply$1.apply$mcD$sp(CrossValidator.scala:152) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5$$anonfun$apply$1.apply(CrossValidator.scala:151) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5$$anonfun$apply$1.apply(CrossValidator.scala:151) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) at scala.concurrent.impl.Future$.apply(Future.scala:31) at scala.concurrent.Future$.apply(Future.scala:494) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5.apply(CrossValidator.scala:160) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$5.apply(CrossValidator.scala:150) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$4.apply(CrossValidator.scala:150) ... 27 more Caused by: java.lang.UnsatisfiedLinkError: /data/HadoopData/nm-local-dir/usercache/couture/appcache/application_1567507960273_0008/container_1567507960273_0008_02_000008/tmp/mml-natives4523911706023596553/lib_lightgbm.so: /data/HadoopData/nm-local-dir/usercache/couture/appcache/application_1567507960273_0008/container_1567507960273_0008_02_000008/tmp/mml-natives4523911706023596553/lib_lightgbm.so: cannot apply additional memory protection after relocation: Cannot allocate memory at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824) at java.lang.Runtime.load0(Runtime.java:809) at java.lang.System.load(System.java:1086) at com.microsoft.ml.spark.NativeLoader.loadLibraryByName(NativeLoader.java:59) at com.microsoft.ml.spark.LightGBMUtils$.initializeNativeLibrary(LightGBMUtils.scala:38) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:301) at com.microsoft.ml.spark.TrainUtils$$anonfun$7.apply(TrainUtils.scala:298) at com.microsoft.ml.spark.StreamUtilities$.using(StreamUtilities.scala:29) at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:297) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at com.microsoft.ml.spark.LightGBMBase$$anonfun$4.apply(LightGBMBase.scala:49) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
@jaipal1995 "lib_lightgbm.so: cannot apply additional memory protection after relocation: Cannot allocate memory at java.lang.ClassLoader$NativeLibrary.load"
it looks like an out of memory error? How large is your dataset (num rows, num cols)? How many workers do you have in your cluster and what are their specifications? What are the driver's specifications (as opposed to workers)?
"In my scenario, there are several models trained at a time using parArray. Do you have any suggestions on this scenario for optimization?" I've sometimes seen users have issues when training more than one model in parallel. I probably need to investigate more to understand what might be happening there.
Hi @imatiach-msft , I have a dataframe with around 300 something called Brickids. I have to train a model for each brickid data (currently training for each brickid sequentially) and brickid training data can have rows number approximately varying from 100 to 50000 and around 42 columns. Cluster Configuration: maximum executors for an application is 24, each executor with 4GB RAM and 1 core, and the driver has 6GB RAM and 1 core. I am using GridSearch which has a total of 48 combinations and cross Validator with 3 folds. Hence, 48*3 different models have to be trained and this is how my LightGBMRegressor object and looks like : ` val lightGBMRegressor: LightGBMRegressor = new LightGBMRegressor() .setLabelCol("label") .setFeaturesCol("features") .setObjective("regression") .setMinSumHessianInLeaf(10) .setVerbosity(3) .setBaggingFraction(0.9f) .setFeatureFraction(0.8) .setEarlyStoppingRound(50) .setBaggingSeed(10011993) .setNumLeaves(10) .setPredictionCol("prediction") .setBoostingType("gbdt") .setWeightCol("logWeights")
val paramGrid: Array[ParamMap] = new ParamGridBuilder() .addGrid(lightGBMRegressor.maxDepth, Array(4, 6, 8)) .addGrid(lightGBMRegressor.learningRate, Array(0.05, 0.1)) .addGrid(lightGBMRegressor.numLeaves, Array(5, 10, 20, 40)) .addGrid(lightGBMRegressor.numIterations, Array(200, 500)) .build()
val pipeline: Pipeline = new Pipeline().setStages(Array(lightGBMRegressor))
val cv: CrossValidator = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(3) .setParallelism(24)`
Kindly, correct me if I am doing anything wrong at any point
the reduce jobs are taking a lot of time, only one executor is having the data: Attached the screenshots, if they could be of any help.
Hi @imatiach-msft , Any updates on this?
@jaipal1995 sorry about the trouble you are having. Agree, it looks like there is a bug with doing hyperparameter tuning (multiple jobs in parallel) with lightgbm. I think I might be able to reproduce the issue myself, but if your dataset is public and you could send it to me with the code to reproduce the exact issue you are having it would make it easier for me to diagnose the problem you are having. If you are on a tight deadline though I wouldn't recommend waiting on me, I would try out either sequentially creating the lightgbm models or using GBTClassifier from SparkML. At the same time if you could give me a repro that would help me fix the issue you are having or at least identify the exact problem.
It does seem that all of the data is being put on one executor and that may explain the out of memory errors that you are getting. I'm not sure about the cause however, and I can't quite tell what the issue might be from the logs above.
@imatiach-msft , Where do I share the data and code? If you have your mail id , kindly share.
hi @jaipal1995 , you can send the data and code to: ilmat at microsoft dot com otherwise, you can also just share it on github (eg a link to the dataset stored on GitHub or some server location), as long as it is not private
I have been facing NullPointerException mentioned above followed by 'Connection refused' error. It seems like connections are being closed to executors before finishing all the tasks. Can you please look into this?
I am running in cluster mode with spark-submit command and dynamic allocation enabled. Stacktrace
If the bug pertains to a specific feature please tag the appropriate CODEOWNER for better visibility
Additional context Add any other context about the problem here.