Open sunjianzhou opened 5 years ago
Issued same problem. Produced a sparse DataFrame that is later used for MLLib, and there is a difficulty in exporting ArrayType. Converting back to very sparse DataFrame with many (100k and up) columns in spark is very inefficient. Also, Spark MLLib intends to working with sparse DataFrames, that later are difficult to export. What could be the solution?
my spark dataframe are as below: +--------------------+---------+ | feature| label| +--------------------+---------+ |[-5395.3890376257...|[0.0,1.0]| |[6.69571816328211...|[1.0,0.0]| |[-2870.5446747200...|[0.0,1.0]| |[4240.09470794739...|[1.0,0.0]| |[-9969.1310950791...|[0.0,0.0]| |[494.875486401857...|[0.0,0.0]| +--------------------+---------+ the format of data for label is DenseVector, since i don't know whether the sparse vector can work in tensorflow.
then i got the errors when i execute the command "data_frame.repartition(num_partition).write.format("tfrecords").mode("overwrite").save(save_path)"
here is the error message. Caused by: java.lang.RuntimeException: Cannot convert field to unsupported data type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce
org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 in stage 73.0 failed 1 times, most recent failure: Lost task 162.0 in stage 73.0 (TID 1392, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
please help.