springml / spark-sftp

Spark connector for SFTP
Apache License 2.0
100 stars 98 forks source link

DataFrame created has incorrect schema #9

Closed vaibhavpals closed 7 years ago

vaibhavpals commented 7 years ago

I want to use a custom schema for creating the dataframe. On executing the below code

val customSchema = StructType(Array(StructField("firstName",StringType,true),StructField("lastName",StringType,true),StructField("age",IntegerType,true)))

val df = sqlContext.read.
        format("com.springml.spark.sftp").
        option("host", "localhost").
        option("username", "root").
        option("password", "****").
        option("fileType", "csv").
        option("inferSchema", "false").
        option("header","false").
        schema(customSchema).
        load("/home/files/data_people.csv")

df.printSchema()

df.show()

The output i get is as follows:

root
 |-- C0: string (nullable = true)
 |-- C1: string (nullable = true)
 |-- C2: string (nullable = true)

+-------+----------+---+
|     C0|        C1| C2|
+-------+----------+---+
|John   |     last | 24|
|Jack   |     last | 25|
+-------+----------+---+

Clearly the schema provided is not being considered. Any suggestions on how to get this working?

mittalakhilesh commented 7 years ago

I am having same issue, is there a work around for this?

samuel-pt commented 7 years ago

@mittalakhilesh This is a bug and I am working on this. Current there is no workaround for this issue. I'll update this ticket once I push the fix.

vaibhavpals commented 7 years ago

@samuel-pt please take a look at the pull request submitted by me to fix this issue

Sent from my OnePlus ONEPLUS A3003 using FastHub

mittalakhilesh commented 7 years ago

Currently I used workaround of converting DF to RDD and then again back to DF with schema.

val dfRdd = df.rdd val newDf = sparkSession.createDataFrame(dfRdd, schema)

newDf
springml commented 7 years ago

This issue is resolved by https://github.com/springml/spark-sftp/commit/55a6764e77b767d64835ed7c1ac32438d7023398