microsoft / azuredatastudio

Azure Data Studio is a data management and development tool with connectivity to popular cloud and on-premises databases. Azure Data Studio supports Windows, macOS, and Linux, with immediate capability to connect to Azure SQL and SQL Server. Browse the extension library for more database support options including MySQL, PostgreSQL, and MongoDB.
https://learn.microsoft.com/sql/azure-data-studio
MIT License
7.5k stars 882 forks source link

spark.sparkContext.addPyFile() doesn't find file in ADS when using pyspark kernel #6784

Closed sfweller closed 1 year ago

sfweller commented 4 years ago

Steps to Reproduce:

  1. Upload the XGBOOST package jar files to a big data Aris cluster. You can download them from here:
    https://repo1.maven.org/maven2/ml/dmlc/xgboost4j/0.72/xgboost4j-0.72.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark/0.72/xgboost4j-spark-0.72.jar

    Upload the jar files to the 'jar' directory on the cluster.

2). Upload the spark python files for xgboost to the '/user/root' cluster folder. You can download them from here:
https://github.com/dmlc/xgboost/files/2161553/sparkxgb.zip

3). Open a new 'pyspark' notebook and paste the following code into the first set of cells: from pyspark import SparkContext from pyspark.sql import SparkSession import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars xgboost4j-spark-0.72.jar,xgboost4j-0.72.jar pyspark-shell'

spark = SparkSession\ .builder\ .appName("PySpark XGBOOST Titanic")\ .master("local[*]")\ .getOrCreate()

4). Run this code. Note this part of the code runs fine.

5). Add another cell with the following line: spark.sparkContext.addPyFile("/user/root/sparkxgb.zip")

6). Run this cell. You will get the following error message: An error occurred while calling o106.addFile. : java.io.FileNotFoundException: File file:/user/root/sparkxgb.zip does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:641)

The file exists at this location, however it is not found.

The code is taken from here: https://towardsdatascience.com/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb

Charles-Gagnon commented 1 year ago

BDC support in ADS is being deprecated so closing all related issues.