Azure Data Studio is a data management and development tool with connectivity to popular cloud and on-premises databases. Azure Data Studio supports Windows, macOS, and Linux, with immediate capability to connect to Azure SQL and SQL Server. Browse the extension library for more database support options including MySQL, PostgreSQL, and MongoDB.
3). Open a new 'pyspark' notebook and paste the following code into the first set of cells:
from pyspark import SparkContext
from pyspark.sql import SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars xgboost4j-spark-0.72.jar,xgboost4j-0.72.jar pyspark-shell'
4). Run this code. Note this part of the code runs fine.
5). Add another cell with the following line:
spark.sparkContext.addPyFile("/user/root/sparkxgb.zip")
6). Run this cell. You will get the following error message:
An error occurred while calling o106.addFile.
: java.io.FileNotFoundException: File file:/user/root/sparkxgb.zip does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:641)
The file exists at this location, however it is not found.
Steps to Reproduce:
Upload the XGBOOST package jar files to a big data Aris cluster. You can download them from here:
https://repo1.maven.org/maven2/ml/dmlc/xgboost4j/0.72/xgboost4j-0.72.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark/0.72/xgboost4j-spark-0.72.jar
Upload the jar files to the 'jar' directory on the cluster.
2). Upload the spark python files for xgboost to the '/user/root' cluster folder. You can download them from here:
https://github.com/dmlc/xgboost/files/2161553/sparkxgb.zip
3). Open a new 'pyspark' notebook and paste the following code into the first set of cells: from pyspark import SparkContext from pyspark.sql import SparkSession import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars xgboost4j-spark-0.72.jar,xgboost4j-0.72.jar pyspark-shell'
spark = SparkSession\ .builder\ .appName("PySpark XGBOOST Titanic")\ .master("local[*]")\ .getOrCreate()
4). Run this code. Note this part of the code runs fine.
5). Add another cell with the following line: spark.sparkContext.addPyFile("/user/root/sparkxgb.zip")
6). Run this cell. You will get the following error message: An error occurred while calling o106.addFile. : java.io.FileNotFoundException: File file:/user/root/sparkxgb.zip does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:641)
The file exists at this location, however it is not found.
The code is taken from here: https://towardsdatascience.com/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb