springml / spark-sftp

Spark connector for SFTP
Apache License 2.0
100 stars 98 forks source link

Downloading to Tmp in local directory and reading from hdfs #24

Open prachiDev opened 6 years ago

prachiDev commented 6 years ago

SFTP file is getting wonloaded on my local system /tmp folder. But is being read from HDFS location which doesnot exist. Specifying tmp directory did not help here

Here are the logs

18/05/31 07:58:30 INFO client.SFTPClient: Copying files from /test_data/production_test.json to /tmp/production_test.json 18/05/31 07:58:31 INFO client.SFTPClient: Copied files successfully... 18/05/31 07:58:31 INFO json.JSONRelation: Listing hdfs://nameservice1/tmp/production_test.json on driver Exception in thread "main" java.io.FileNotFoundException: File hdfs://nameservice1/tmp/production_test.json does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:735)

samuel-pt commented 6 years ago

@prachiDev - Please send us the full stack trace.

Also please copy the code that you are using

jctobin commented 6 years ago

i have the same error using an older version, code below:

#start pyspark with packages that work within the environment i have to use  
pyspark --packages com.springml:spark-sftp_2.10:1.0.2
 df = sqlContext.read.format("com.springml.spark.sftp") 
    .option("host", myftpsite)
    .option("username", myuser)
    .option("password", mypassword)
    .option("filetype", "csv")
    .option("header", "true")
    .load("/path/to/ftp/test.csv")

...
18/06/21 19:10:43 INFO metastore: Connected to metastore.
18/06/21 19:10:44 INFO SessionState: Created local directory: /tmp/ed21ad9b-4513-439d-984f-80c8e4a507de_resources
18/06/21 19:10:44 INFO SessionState: Created HDFS directory: /tmp/hive/USER/ed21ad9b-4513-439d-984f-80c8e4a507de
18/06/21 19:10:44 INFO SessionState: Created local directory: /tmp/USER/ed21ad9b-4513-439d-984f-80c8e4a507de
18/06/21 19:10:44 INFO SessionState: Created HDFS directory: /tmp/hive/USER/ed21ad9b-4513-439d-984f-80c8e4a507de/_tmp_space.db
18/06/21 19:10:44 INFO DefaultSource: Copying /path/to/ftp/test.csv to /tmp/test.csv
18/06/21 19:10:45 INFO SFTPClient: Copying files from /path/to/ftp/test.csv to /tmp/test.csv
18/06/21 19:10:45 INFO SFTPClient: Copied files successfully...
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/2.5.3.0-37/spark/python/pyspark/sql/readwriter.py", line 137, in load
    return self._df(self._jreader.load(path))
  File "/path/to/site-packages/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/path/to/2.5.3.0-37/spark/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/path/to/site-packages/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o52.load.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://NAMESERVICE/tmp/test.csv

However if I run on on the server:

cat /tmp/test.csv
col1,col2
'a',1
'b',2

but checking hdfs

hdfs dfs -cat /tmp/test.csv
cat: `/tmp/test.csv': No such file or directory

so it looks like the file isn't being copied to hdfs
perhaps this has been corrected with a more updated version, but for my work, I am only able to use spark1.x

my current work-around is: to run the initial read.format("com.springml.spark.sftp"), wait for it to fail, then run df = sqlContext.read.format("csv").option("header", "true").load("/tmp/test.csv")

it feels super ugly, but at least i have instant results

viveknair89 commented 6 years ago

Using springml version 1.1.2 solved this issue for me sbt dependency: "com.springml" % "spark-sftp_2.11" % "1.1.2"

rgtv commented 5 years ago

I also encountered similar problems。 spark.read. format("com.springml.spark.sftp"). option("host", "**********"). option("username", "**********"). option("password", "*********"). option("fileType", "txt"). load("test.txt") Error is 18/10/08 10:28:29,792 INFO Driver SFTPClient: Copying file from /pcspuser/ycj/check.txt to /data03/yarn/usercache/aps/appcache/application_1537433932246_8153913/container_1537433932246_8153913_01_000001/tmp/check.txt 18/10/08 10:28:29,817 INFO Driver SFTPClient: Copied files successfully... 18/10/08 10:28:29,963 ERROR Driver ApplicationMaster: User class threw exception: org.apache.hadoop.security.AccessControlException: /data03/yarn/usercache/aps/appcache/application_1537433932246_8153913/container_1537433932246_8153913_01_000001/tmp (is not a directory)

I use the latest version。 My spark version is spark-2.1.0.9。