Closed leizhanggit closed 5 years ago
I take it this is on Azure. It looks like the NC6s_v3 uses the Tesla V100 which should be fine. I'm not sure how much memory is on there, I'm guessing like 16GB. If you have 14GB csv data with only one worker, it and the results won't all fit in memory. So that could be your issue. But you should look at the UI and logs to see what is going on.
On Databricks look at the SparkUI, either from the notebook or from the cluster view and see if tasks have failed. If you see failed tasks look at the logs from them via the links on the UI. If you don't see those just look at the driver logs from the cluster page.
Actually was your CSV file a single file or multiple? This first release of XGBoost for spark doesn't support splitting files, so you would need to split that file yourself into smaller files so you get multiple partitions. We are adding support for splitting in the next release.
Thanks! I think it is related to file splitting. It seems that single big csv file works and multiple smaller csv files failed.
I did two tests:
// splitting spark dataframe
val Array(my_spark_df1, my_spark_df2) = tempDF.randomSplit(Array(0.3, 0.7))
val data_path = "...my_data.csv"
Now, if I save the csv file as multiple small files, it could report error during the training stage
my_spark_df1.write.option("header", "true").csv(data_path)
but if I save the csv file into one big file, the training stage can be done sucuessfully.
my_spark_df1.repartition(1).write.option("header", "true").csv(data_path)
update: I am still failed on using 7Gb single file CSV. 4Gb or less single file CSV is OK. It is still unkown why the job failed
19/08/01 22:44:50 ERROR Uncaught throwable from user code: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
When your job is submitted to Spark, you could run “ watch -n 0.1 nvidia-smi” to monitor the Gpu memory usage. I thought a 3g train set should be fine, but 4g may fail.
CSV file split will be supported after next release.
On Fri, Aug 2, 2019 at 7:25 AM leizhanggit notifications@github.com wrote:
update: I am still failed on using 7Gb single file CSV. 4Gb or less single file CSV is OK. It is still unkown why the job failed
19/08/01 22:44:50 ERROR Uncaught throwable from user code: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rapidsai/spark-examples/issues/34?email_source=notifications&email_token=AE4HIGQS2M6YA24R4MY5VYDQCNWF3A5CNFSM4IISZC22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3ME47Q#issuecomment-517492350, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4HIGSZKD65FPAIHIOLR53QCNWF3ANCNFSM4IISZC2Q .
Likely, you should have multiple workers (executors) so that your dataset could be fit in GPU.
How many workers do I need? Is there a guideline about the ratio between the size of data and number of workers.
On Thu, Aug 1, 2019, 5:56 PM anfeng notifications@github.com wrote:
Likely, you should have multiple workers (executors) so that your dataset could be fit in GPU.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rapidsai/spark-examples/issues/34?email_source=notifications&email_token=ADAEXB24O3NYTJCQKH2RKLTQCOA2PA5CNFSM4IISZC22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3MIVKI#issuecomment-517507753, or mute the thread https://github.com/notifications/unsubscribe-auth/ADAEXB7GD2X5RPU6C7H7NVLQCOA2PANCNFSM4IISZC2Q .
its odd it failed with smaller files, that should have normally worked. Do you have 1 task per GPU configured? ie make sure Spark isn't putting multiple tasks on the same executor all trying to use the same GPU. 1 executor should only run 1 task at a time and be allocated 1 GPU.
Did you look at the spark UI or driver logs?
After additional workers, I think I am able to run through with bigger training data.
So here are the tests I tried:
> training_data_size . workers result
> 800 M 1 pass
> 1.5 G 1 pass
> 4.0 G 1 pass
> 4.8 G 1 failed
> 4.8 G 2 pass
I would assume the biggest training data per each worker (GPU) is somewhere between 4G and 4.8G .
Most of the code are borrowed from mortgage example. All of the configuration are the same as mortgage example.
The issue is resolved per @leizhanggit's latest comment
Hello,
I am trying to train a model with 14GB csv data file. I just made a slight change of the spark demo code. (I have no problem to run the demo code with the mortgage data )
Here is the error:
and the debug message:
Here is the configuration of the cluster: Standard_NC6s_v3 (beta) . (driver) Standard_NC6s_v3 (beta) . (worker) 5.4 ML (includes Apache Spark 2.4
I simply modified the example with the mortgage data. How can I obtain more detailed debug information.