springml / spark-sftp

Spark connector for SFTP
Apache License 2.0
100 stars 98 forks source link

NullPointerException when reading file from sftp #65

Open sslavian812 opened 5 years ago

sslavian812 commented 5 years ago

I'm trying to read a csv file from sftp server and convert to dataframe. The file is in /ppreports/outgoing/MY.CSV. I can see it when logging in with a GUI.

val df = spark.read
            .format("com.springml.spark.sftp")
            .option("host", HOST)
            .option("username", USER)
            .option("password", PASSWORD)
            .option("fileType", "csv")
            .option("inferSchema", "false")
            .option("createDF", "false")
            .load("/ppreports/outgoing/MY.CSV")

I get

java.lang.NullPointerException
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:453)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:291)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:277)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:212)

If I try to read non-existing file:

val df = spark.read
            .format("com.springml.spark.sftp")
            .option("host", HOST)
            .option("username", USER)
            .option("password", PASSWORD)
            .option("fileType", "csv")
            .option("inferSchema", "false")
            .option("createDF", "false")
            .load("/ppreports/outgoing/non-existing.CSV")

Then I'll predictable get file not found:

2: No such file or directory
    at com.jcraft.jsch.ChannelSftp.throwStatusError(ChannelSftp.java:2833)
    at com.jcraft.jsch.ChannelSftp._stat(ChannelSftp.java:2185)
    at com.jcraft.jsch.ChannelSftp._stat(ChannelSftp.java:2202)
    at com.jcraft.jsch.ChannelSftp.get(ChannelSftp.java:914)
    at com.jcraft.jsch.ChannelSftp.get(ChannelSftp.java:874)
    at com.springml.sftp.client.SFTPClient.copyInternal(SFTPClient.java:168)
    at com.springml.sftp.client.SFTPClient.copy(SFTPClient.java:74)
    at com.springml.spark.sftp.DefaultSource.copy(DefaultSource.scala:212)
    at com.springml.spark.sftp.DefaultSource.createRelation(DefaultSource.scala:80)
    at com.springml.spark.sftp.DefaultSource.createRelation(DefaultSource.scala:41)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:346)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:291)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:277)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:212)

Thus, I conclude that file is there and spark-sftp finds it, but fails to download. What should I do?

samuel-pt commented 5 years ago

@sslavian812 - What content is present in the file? Can you check whether it is valid CSV file?

Also try to use the latest spark-sftp connector as we solved similar issue like this

sslavian812 commented 5 years ago

Hi @samuel-pt ,thank you for the answer. I'm still struggling with NPE while reading csv file.

whether it is valid CSV file?

It's a text file, a regular csv. I can download it with curl and open on local machine.

latest spark-sftp

Upgraded from 1.3 to com.springml:spark-sftp_2.11:1.1.5, didn't help.

Seems, that I'll have to implement something custom, say download csv with apache-commons-vfs, upload it to s3 and then read into dataframe using standard api.

AJAnujsharma commented 5 years ago

Yes this is a issue, Even i'm facing it too.. java.lang.NullPointerException --> When reading a exsisting fie and even upgrading - com.springml:spark-sftp_2.11:1.1.5 didnt helped

Let me know if any other option can be implemented

vejeta commented 5 years ago

Can you provide a sample of the file to be tested?

AJAnujsharma commented 5 years ago

You can use any file either csv or txt. How it does perform is It is trying to perform two things at a same time

  1. Copy the file from sftp to temp location in dbfs
  2. Reading from dbfs

thats why it is failing which is a bug.

If u do a try catch block try(read the file but do not create dataframe){ copy the data in the dbfs }catch(once its copied u can load the file to dataframe)

this is a temprory solution but this is a bug

yuvapraveen commented 4 years ago

@AJAnujsharma Can you please provide the code snippet that you used to sftp from databricks. Not sure I get what you are doing in your catch block. Thanks in advance.

sauerch91 commented 4 years ago

@AJAnujsharma Can you please provide the code snippet that you used to sftp from databricks. Not sure I get what you are doing in your catch block. Thanks in advance.

That works for me! example sftp server used here // try/except is the workaround try: df = (spark\ .read\ .format("com.springml.spark.sftp")\ .option("host", "test.rebex.net")\ .option("username", sftp_user)\ .option("password", sftp_password)\ .option("fileType", "txt")\ .option("tempLocation", "/dbfs/tmp/")\ .load("/pub/example/readme.txt")) except: df = (spark\ .read\ .format("com.springml.spark.sftp")\ .option("host", "test.rebex.net")\ .option("username", sftp_user)\ .option("password", sftp_password)\ .option("fileType", "txt")\ .option("tempLocation", "/tmp/")\ .load("/pub/example/readme.txt"))

yuvapraveen commented 4 years ago

@sauerch91 were you able to write to a sftp server? If so can you give me the snippet please.. seems like the library cannot ready from the temporary dbfs location..

DataBach-maker commented 2 years ago

I am with @yuvapraveen . Does someone have a working example ? Struggling to write to SFTP server and get NPE with newest version 1.0.3.