Closed leizhanggit closed 5 years ago
Can you please provide more specific information about your Databricks setup where you got the error message. What runtime, what type of worker, what exactly is in your init script, what is the path to your init script in the cluster advanced settings?
GPUDataset has a very limited api right now, mostly just Read and then for use with XGBoost. If you need to do other ETL processing you will have to use normal Spark for that. We are working on getting more ETL operators implemented but it will through a different plugin and using Spark 3.0. You can find GPUDataset in the XGBoost code: https://github.com/rapidsai/xgboost/blob/rapids-spark/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/rapids/GpuDataset.scala
Note, I tested this out this morning to make sure Databricks didn't make any changes and it did work for me. Here is the configuration I was using:
Cluster Mode: Standard runtime: 5.4 ML (includes Apache Spark 2.4.3, GPU, Scala 2.11) Python Version 3 Disable autoscale WorkerType: p3.2xlarge 1 worker DriverType p3.2xlarge Advanced Options default except for:
add ssh keys
Init Scripts: dbfs:/databricks/scripts/init.sh
Downloaded the 3 jars and put them into dbfs as instructions specified. Keep track of the paths it uses as you need them for the init.sh script
my init.sh contents are below. You will have to update this script to match the file names you uploaded- MAKE SURE THIS IS CORRECT as this could cause your error. To be clear in the script below update the file names:
/7194b940_46ba_41ef_8a41_e8e71130484b-xgboost4j_0_90_1_Beta-6dea8.jar, /6587ff10_a44f_4590_8828_473d6d72e350-cudf_0_8-aaf1d.jar, and d541fac4_424f_459c_86d5_166d1047c664-xgboost4j_spark_0_90_1_Beta-bcdc8.jar
to match what you uploaded.
sudo cp /dbfs/FileStore/jars/7194b940_46ba_41ef_8a41_e8e71130484b-xgboost4j_0_90_1_Beta-6dea8.jar /databricks/jars/spark--maven-trees--ml--xgboost--ml.dmlc--xgboost4j--ml.dmlcxgboost4j0.81.jar sudo cp /dbfs/FileStore/jars/6587ff10_a44f_4590_8828_473d6d72e350-cudf_0_8-aaf1d.jar /databricks/jars/ sudo cp /dbfs/FileStore/jars/d541fac4_424f_459c_86d5_166d1047c664-xgboost4j_spark_0_90_1_Beta-bcdc8.jar /databricks/jars/spark--maven-trees--ml--xgboost--ml.dmlc--xgboost4j-spark--ml.dmlcxgboost4j-spark0.81.jar
used the dbfs cli to upload it: dbfs cp init.sh dbfs:/databricks/scripts/init.sh
start cluster import notebook and run.
Thank you so much! The problem is in my init.sh script. Now the code works!
Great, glad to hear, I will close the issue then.
Hi all,
I am trying the demo on databricks: https://github.com/rapidsai/spark-examples/blob/master/docs/databricks.md I followed every step, and then an error pop out, when spark read csv and train the model:
The error message is:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 10.139.64.7, executor 0): java.lang.UnsatisfiedLinkError: ai.rapids.cudf.Table.gdfReadCSV([Ljava/lang/String;[Ljava/lang/String;[Ljava/lang/String;Ljava/lang/String;JJIBBB[Ljava/lang/String;[Ljava/lang/String;[Ljava/lang/String;)[J
I am also trying to find APIs of GpuDataset, but I can not find any. I can not run the following code, as no "count" function exists. BTW, what is the relation between GpuDataset and cuDF?