googledatalab / datalab

Interactive tools and developer experiences for Big Data on Google Cloud Platform.
Apache License 2.0
974 stars 249 forks source link

AnalysisException: u'Path does not exist #2136

Open sofgun opened 5 years ago

sofgun commented 5 years ago

Hi. I have created a Google Datalab on a dataproc cluster. I've managed to link the notebook to my dataset by : !rm german_credit_data_biased_training.csv !wget https://raw.githubusercontent.com/emartensibm/german-credit/master/german_credit_data_biased_training.csv from pyspark.sql import SparkSession import pandas as pd import json pd_data = pd.read_csv("german_credit_data_biased_training.csv", sep=",", header=0)

The above works fine, but when I run this one:

df_data = sc.read.csv(path='german_credit_data_biased_training.csv', sep=',', header=True, inferSchema=True)

I get the error `AnalysisExceptionTraceback (most recent call last)

in () 3 #df_data = sc.read.csv.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("german_credit_data_biased_training.csv")); 4 #df_data = sc.read.format("csv").option("header", "true").load("german_credit_data_biased_training.csv") ----> 5 df_data = sc.read.csv(path='german_credit_data_biased_training.csv', sep=',', header=True, inferSchema=True)` `/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py in csv(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode, columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping) 439 path = [path] 440 if type(path) == list: --> 441 return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) 442 elif isinstance(path, RDD): 443 def func(iterator):` `/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, self.target_id, self.name) 1258 1259 for temp_arg in temp_args:` `/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py` `in deco(*a, **kw) 67 e.java_exception.getStackTrace())) 68 if s.startswith('org.apache.spark.sql.AnalysisException: '): ---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace) 70 if s.startswith('org.apache.spark.sql.catalyst.analysis'): 71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)` `AnalysisException: u'Path does not exist: hdfs://mycluster-m/user/root/german_credit_data_biased_training.csv;'` I've tried different versions of the expression above, but get the same error. Does anyone know how to connect my dataset to the path that currently doesn't exist?