springml / spark-sftp

Spark connector for SFTP
Apache License 2.0
100 stars 98 forks source link

Why is df.coalesce(1) necessary? #57

Open sunayansaikia opened 5 years ago

sunayansaikia commented 5 years ago

Hey folks,

Just wanted to understand why 'df.coalesce(1)' was done while writing the dataframe to DFS? Please refer code here: https://github.com/springml/spark-sftp/blob/master/src/main/scala/com/springml/spark/sftp/DefaultSource.scala#L249

Thanks

samuel-pt commented 5 years ago

@sunayansaikia - That was done to have a single file in SFTP

sunayansaikia commented 5 years ago

Hey @samuel-pt : is this a hard requirement? Can we not download multiple files for upload via SFTP? Can't this option be made configurable?

samuel-pt commented 5 years ago

@sunayansaikia - Its not hard. we can just add a configurable parameter and use it. Following are needed

  1. Actual Code changes
  2. Tests for the new changes
  3. README update accordingly
sunayansaikia commented 5 years ago

ok - cool. Will take a checkout and see.

shaikmanu797 commented 5 years ago

@sunayansaikia, the below PR should be able to fix the coalesce numPartitions configuration https://github.com/springml/spark-sftp/pull/68

sunayansaikia commented 5 years ago

@shaikmanu797 : Great!