Xgboost Training failed, on 14Gb csv file

leizhanggit commented 5 years ago

Hello,

I am trying to train a model with 14GB csv data file. I just made a slight change of the spark demo code. (I have no problem to run the demo code with the mortgage data )

Here is the error:

------ Training ------ Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.139.64.33, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1} ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed

and the debug message:

    at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:795)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributedForGpuDataset$1.apply(XGBoost.scala:609)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributedForGpuDataset$1.apply(XGBoost.scala:591)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.immutable.List.map(List.scala:296)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributedForGpuDataset(XGBoost.scala:590)
    at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.fit(XGBoostClassifier.scala:258)
    at linee6f0a941471544b887f97374bba146ad102.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-1436081708071208:14)
    at linee6f0a941471544b887f97374bba146ad102.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-1436081708071208:14)
    at linee6f0a941471544b887f97374bba146ad102.$read$$iw$$iw$$iw$$iw$$iw$$iw$Benchmark$.time(command-1436081708071208:4)
    at linee6f0a941471544b887f97374bba146ad102.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-1436081708071208:13)
    at linee6f0a941471544b887f97374bba146ad102.$read$$iw$$iw$$iw$$iw$$iw.<init>(command-1436081708071208:73)
    at linee6f0a941471544b887f97374bba146ad102.$read$$iw$$iw$$iw$$iw.<init>(command-1436081708071208:75)
    at linee6f0a941471544b887f97374bba146ad102.$read$$iw$$iw$$iw.<init>(command-1436081708071208:77)
    at linee6f0a941471544b887f97374bba146ad102.$read$$iw$$iw.<init>(command-1436081708071208:79)
    at linee6f0a941471544b887f97374bba146ad102.$read$$iw.<init>(command-1436081708071208:81)
    at linee6f0a941471544b887f97374bba146ad102.$read.<init>(command-1436081708071208:83)
    at linee6f0a941471544b887f97374bba146ad102.$read$.<init>(command-1436081708071208:87)
    at linee6f0a941471544b887f97374bba146ad102.$read$.<clinit>(command-1436081708071208)
    at linee6f0a941471544b887f97374bba146ad102.$eval$.$print$lzycompute(<notebook>:7)
    at linee6f0a941471544b887f97374bba146ad102.$eval$.$print(<notebook>:6)
    at linee6f0a941471544b887f97374bba146ad102.$eval.$print(<notebook>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:793)
    at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1054)
    at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:645)
    at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:644)
    at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
    at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
    at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:644)
    at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:576)
    at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:572)
    at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:215)
    at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:197)
    at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:197)
    at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:197)
    at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:653)
    at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:606)
    at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:197)
    at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$8.apply(DriverLocal.scala:342)
    at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$8.apply(DriverLocal.scala:319)
    at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:238)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
    at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:233)
    at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:47)
    at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:271)
    at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:47)
    at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:319)
    at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
    at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
    at scala.util.Try$.apply(Try.scala:192)
    at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:639)
    at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:485)
    at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597)
    at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390)
    at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
    at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
    at java.lang.Thread.run(Thread.java:748)

Here is the configuration of the cluster: Standard_NC6s_v3 (beta) . (driver) Standard_NC6s_v3 (beta) . (worker) 5.4 ML (includes Apache Spark 2.4

I simply modified the example with the mortgage data. How can I obtain more detailed debug information.

tgravescs commented 5 years ago

I take it this is on Azure. It looks like the NC6s_v3 uses the Tesla V100 which should be fine. I'm not sure how much memory is on there, I'm guessing like 16GB. If you have 14GB csv data with only one worker, it and the results won't all fit in memory. So that could be your issue. But you should look at the UI and logs to see what is going on.

On Databricks look at the SparkUI, either from the notebook or from the cluster view and see if tasks have failed. If you see failed tasks look at the logs from them via the links on the UI. If you don't see those just look at the driver logs from the cluster page.

tgravescs commented 5 years ago

Actually was your CSV file a single file or multiple? This first release of XGBoost for spark doesn't support splitting files, so you would need to split that file yourself into smaller files so you get multiple partitions. We are adding support for splitting in the next release.

leizhanggit commented 5 years ago

Thanks! I think it is related to file splitting. It seems that single big csv file works and multiple smaller csv files failed.

I did two tests:

// splitting spark dataframe
val Array(my_spark_df1, my_spark_df2) = tempDF.randomSplit(Array(0.3, 0.7))
val data_path = "...my_data.csv"

Now, if I save the csv file as multiple small files, it could report error during the training stage

my_spark_df1.write.option("header", "true").csv(data_path)

but if I save the csv file into one big file, the training stage can be done sucuessfully. my_spark_df1.repartition(1).write.option("header", "true").csv(data_path)

leizhanggit commented 5 years ago

update: I am still failed on using 7Gb single file CSV. 4Gb or less single file CSV is OK. It is still unkown why the job failed

19/08/01 22:44:50 ERROR Uncaught throwable from user code: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed

wjxiz1992 commented 5 years ago

When your job is submitted to Spark, you could run “ watch -n 0.1 nvidia-smi” to monitor the Gpu memory usage. I thought a 3g train set should be fine, but 4g may fail.

CSV file split will be supported after next release.

On Fri, Aug 2, 2019 at 7:25 AM leizhanggit notifications@github.com wrote:

update: I am still failed on using 7Gb single file CSV. 4Gb or less single file CSV is OK. It is still unkown why the job failed

19/08/01 22:44:50 ERROR Uncaught throwable from user code: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rapidsai/spark-examples/issues/34?email_source=notifications&email_token=AE4HIGQS2M6YA24R4MY5VYDQCNWF3A5CNFSM4IISZC22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3ME47Q#issuecomment-517492350, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4HIGSZKD65FPAIHIOLR53QCNWF3ANCNFSM4IISZC2Q .

anfeng commented 5 years ago

Likely, you should have multiple workers (executors) so that your dataset could be fit in GPU.

leizhanggit commented 5 years ago

How many workers do I need? Is there a guideline about the ratio between the size of data and number of workers.

On Thu, Aug 1, 2019, 5:56 PM anfeng notifications@github.com wrote:

Likely, you should have multiple workers (executors) so that your dataset could be fit in GPU.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rapidsai/spark-examples/issues/34?email_source=notifications&email_token=ADAEXB24O3NYTJCQKH2RKLTQCOA2PA5CNFSM4IISZC22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3MIVKI#issuecomment-517507753, or mute the thread https://github.com/notifications/unsubscribe-auth/ADAEXB7GD2X5RPU6C7H7NVLQCOA2PANCNFSM4IISZC2Q .

tgravescs commented 5 years ago

its odd it failed with smaller files, that should have normally worked. Do you have 1 task per GPU configured? ie make sure Spark isn't putting multiple tasks on the same executor all trying to use the same GPU. 1 executor should only run 1 task at a time and be allocated 1 GPU.

Did you look at the spark UI or driver logs?

leizhanggit commented 5 years ago

After additional workers, I think I am able to run through with bigger training data.

So here are the tests I tried:

> training_data_size .       workers            result
> 800 M                               1                     pass
> 1.5 G                                  1                     pass
> 4.0 G                                 1                     pass
> 4.8 G                                 1                     failed
> 4.8 G                                 2                     pass

I would assume the biggest training data per each worker (GPU) is somewhere between 4G and 4.8G .

Most of the code are borrowed from mortgage example. All of the configuration are the same as mortgage example.

anfeng commented 5 years ago

The issue is resolved per @leizhanggit's latest comment

rapidsai / spark-examples

Xgboost Training failed, on 14Gb csv file #34