springml / spark-sftp

Spark connector for SFTP
Apache License 2.0
100 stars 98 forks source link

Underlying mechanism #3

Closed aNebula closed 7 years ago

aNebula commented 7 years ago

Would you please explain a bit more in details what you mean by SFTP files are fetched and written using jsch. It is not executed as spark job. It might have issues in cluster?

samuel-pt commented 7 years ago

@aNebula jsch is a java implementation of SSH2. It does not provide options to fetch a file from sftp server in parallel. It will be in a single thread.

Since it is in a single thread, files from SFTP server will be fetched by a single worker. All workers in cluster will not be used. Since it is not parallel, this package will take some time to fetch the files from SFTP server. Apart from this I don't foresee any issue here.

If you don't mind performance, this caveat should not bother you

aNebula commented 7 years ago

@samuel-pt thanks for the quick reply. Single worker fetch sounds fine, however, I'm hoping to fetch multiple files by using multiple workers to load them into dataframes. Can you think of a way how I could do that with your library?

samuel-pt commented 7 years ago

@aNebula You can specify this package to download a single file. So you can create multiple jobs using this package for each file.

springml commented 7 years ago

@aNebula I hope @samuel-pt answer helped you. Please re-open if it is not clear