"Frame length should be positive" problem in XGBoost with CPU (Mortgage-large)

peizhaoliu commented 4 years ago

dear author,

I came across this article "https://github.com/rapidsai/spark-examples/blob/master/getting-started-guides/on-prem-cluster/standalone-scala.md". When i launch distributed training without GPUs (tree method hist), the parameters setting by following: "--num-executors 1 --executor-cores 19 --conf spark.cores.max=19 --conf spark.task.cpus=1 --class ai.rapids.spark.examples.mortgage.CPUMain -numWorkers=19 -treeMethod=hist" However, tasks of the stage "foreachPartition at XGBoost.scala:703" always blocked in "running". In a few hours after submitted the job, we obtained the feeback: java.lang.IllegalArgumentException: Frame length should be positive: -9223371863126827765 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748)

Could you please come up some tips about this issue? Thanks

sincerely

wjxiz1992 commented 4 years ago

Hi, Did you set "nthread" to 1? Because " XGBoost4J-Spark requires that all of nthread * numWorkers cores should be available before the training runs." You can add "-nthread=1" to the end of your cmd directly.

You could also try some other parameter set like:

"--num-executors 1 --executor-cores 1 --conf spark.task.cpus=1 -numWorkers=19 -nthread=1 treeMethod=hist".

For hanging problem, there's a "timeout_request_workers" that may help(but not always). This parameter could reduce the hangi time when your app couldn't get enough resources from Spark. There're also some other possibilities that the program will hang.

To see where it hangs, you could go to Spark's web UI, jump into "Executors" to see the "Thread Dump"

peizhaoliu commented 4 years ago

hi, When lanuch a GPU Mortage example on Spark-standalone, we obtain the following error: 2020-02-25 15:48:08 ERROR NativeDepsLoader:55 - Could not load cudf jni library... java.lang.UnsatisfiedLinkError: /tmp/rmm4687696644621164964.so: libnvToolsExt.so.1: cannot open shared object file: No such file or directory at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824) at java.lang.Runtime.load0(Runtime.java:809) at java.lang.System.load(System.java:1086) at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:83) at ai.rapids.cudf.NativeDepsLoader.loadNativeDeps(NativeDepsLoader.java:51) at ai.rapids.cudf.Table.<clinit>(Table.java:31) at ml.dmlc.xgboost4j.scala.spark.rapids.CSVPartitionReader.readToTable(GpuCSVScan.scala:214) at ml.dmlc.xgboost4j.scala.spark.rapids.CSVPartitionReader.readBatch(GpuCSVScan.scala:194) at ml.dmlc.xgboost4j.scala.spark.rapids.CSVPartitionReader.next(GpuCSVScan.scala:230) I try to explore the dependencies jar, found that "libxgboost4j.so" and "librmm.so" inside there. So why cannot load cudf jni library? Could you please show me some tips to solve this problem?

wjxiz1992 commented 4 years ago

Hi, I guess it's probably you used the wrong version of your cudf jar. You should choose the right version according to your CUDA version. e.g. mvn package -Dcuda.classifier=cuda10, if your cuda is 10.0. You could see your cuda version by "cat /usr/local/cuda/version.txt"

peizhaoliu commented 4 years ago

Thanks your tip! Inspired by the previous suggestions, we adopt the parameter '-nthread' in XGBoost4j-Spark without GPU. The results indicated that it works, '-nthread' can help for optimization. However, when launch distributed training with GPUs, adusting "-nthread=1" to "-nthread=6" seems not take any effect. The full parameter set is "--num-executors 1 --executor-cores 6 --conf spark.task.cpus=6 -numWorkers=1 -nthread=6 treeMethod=gpu_hist". What caused this question? sincerely

rapidsai / spark-examples

"Frame length should be positive" problem in XGBoost with CPU (Mortgage-large) #71