tensorflow / ecosystem

Integration of TensorFlow with other open-source frameworks
Apache License 2.0
1.37k stars 392 forks source link

Cannot convert field to unsupported data type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce #120

Open sunjianzhou opened 5 years ago

sunjianzhou commented 5 years ago

my spark dataframe are as below: +--------------------+---------+ | feature| label| +--------------------+---------+ |[-5395.3890376257...|[0.0,1.0]| |[6.69571816328211...|[1.0,0.0]| |[-2870.5446747200...|[0.0,1.0]| |[4240.09470794739...|[1.0,0.0]| |[-9969.1310950791...|[0.0,0.0]| |[494.875486401857...|[0.0,0.0]| +--------------------+---------+ the format of data for label is DenseVector, since i don't know whether the sparse vector can work in tensorflow.

then i got the errors when i execute the command "data_frame.repartition(num_partition).write.format("tfrecords").mode("overwrite").save(save_path)"

here is the error message. Caused by: java.lang.RuntimeException: Cannot convert field to unsupported data type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce

org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 in stage 73.0 failed 1 times, most recent failure: Lost task 162.0 in stage 73.0 (TID 1392, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.

please help.

mantelllo commented 3 years ago

Issued same problem. Produced a sparse DataFrame that is later used for MLLib, and there is a difficulty in exporting ArrayType. Converting back to very sparse DataFrame with many (100k and up) columns in spark is very inefficient. Also, Spark MLLib intends to working with sparse DataFrames, that later are difficult to export. What could be the solution?