Open prachiDev opened 6 years ago
@prachiDev - Please send us the full stack trace.
Also please copy the code that you are using
i have the same error using an older version, code below:
#start pyspark with packages that work within the environment i have to use
pyspark --packages com.springml:spark-sftp_2.10:1.0.2
df = sqlContext.read.format("com.springml.spark.sftp")
.option("host", myftpsite)
.option("username", myuser)
.option("password", mypassword)
.option("filetype", "csv")
.option("header", "true")
.load("/path/to/ftp/test.csv")
...
18/06/21 19:10:43 INFO metastore: Connected to metastore.
18/06/21 19:10:44 INFO SessionState: Created local directory: /tmp/ed21ad9b-4513-439d-984f-80c8e4a507de_resources
18/06/21 19:10:44 INFO SessionState: Created HDFS directory: /tmp/hive/USER/ed21ad9b-4513-439d-984f-80c8e4a507de
18/06/21 19:10:44 INFO SessionState: Created local directory: /tmp/USER/ed21ad9b-4513-439d-984f-80c8e4a507de
18/06/21 19:10:44 INFO SessionState: Created HDFS directory: /tmp/hive/USER/ed21ad9b-4513-439d-984f-80c8e4a507de/_tmp_space.db
18/06/21 19:10:44 INFO DefaultSource: Copying /path/to/ftp/test.csv to /tmp/test.csv
18/06/21 19:10:45 INFO SFTPClient: Copying files from /path/to/ftp/test.csv to /tmp/test.csv
18/06/21 19:10:45 INFO SFTPClient: Copied files successfully...
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/path/to/2.5.3.0-37/spark/python/pyspark/sql/readwriter.py", line 137, in load
return self._df(self._jreader.load(path))
File "/path/to/site-packages/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/path/to/2.5.3.0-37/spark/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/path/to/site-packages/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o52.load.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://NAMESERVICE/tmp/test.csv
However if I run on on the server:
cat /tmp/test.csv
col1,col2
'a',1
'b',2
but checking hdfs
hdfs dfs -cat /tmp/test.csv
cat: `/tmp/test.csv': No such file or directory
so it looks like the file isn't being copied to hdfs
perhaps this has been corrected with a more updated version, but for my work, I am only able to use spark1.x
my current work-around is:
to run the initial read.format("com.springml.spark.sftp")
, wait for it to fail, then run df = sqlContext.read.format("csv").option("header", "true").load("/tmp/test.csv")
it feels super ugly, but at least i have instant results
Using springml version 1.1.2 solved this issue for me sbt dependency: "com.springml" % "spark-sftp_2.11" % "1.1.2"
I also encountered similar problems。
spark.read. format("com.springml.spark.sftp"). option("host", "**********"). option("username", "**********"). option("password", "*********"). option("fileType", "txt"). load("test.txt")
Error is
18/10/08 10:28:29,792 INFO Driver SFTPClient: Copying file from /pcspuser/ycj/check.txt to /data03/yarn/usercache/aps/appcache/application_1537433932246_8153913/container_1537433932246_8153913_01_000001/tmp/check.txt
18/10/08 10:28:29,817 INFO Driver SFTPClient: Copied files successfully...
18/10/08 10:28:29,963 ERROR Driver ApplicationMaster: User class threw exception: org.apache.hadoop.security.AccessControlException: /data03/yarn/usercache/aps/appcache/application_1537433932246_8153913/container_1537433932246_8153913_01_000001/tmp (is not a directory)
I use the latest version。 My spark version is spark-2.1.0.9。
SFTP file is getting wonloaded on my local system /tmp folder. But is being read from HDFS location which doesnot exist. Specifying tmp directory did not help here
Here are the logs
18/05/31 07:58:30 INFO client.SFTPClient: Copying files from /test_data/production_test.json to /tmp/production_test.json 18/05/31 07:58:31 INFO client.SFTPClient: Copied files successfully... 18/05/31 07:58:31 INFO json.JSONRelation: Listing hdfs://nameservice1/tmp/production_test.json on driver Exception in thread "main" java.io.FileNotFoundException: File hdfs://nameservice1/tmp/production_test.json does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:735)