Open legatoo opened 5 years ago
Hi there,
Not sure why you need.mode(SaveMode.Overwrite)
when writing the dataframe to HDFS.
Have you tried it without.mode(SaveMode.Overwrite)
?
@MajesticKhan I don't think it will help, but I give it a try, the same problem too. According to the doc, mode
is used to Specifies the behavior when data or table already exists
, I use mode
without partitionBy
to produce tfrecord, it works.
The only other idea I have is to first partition the data and then persist or cache the dataframe. From there, try to write it to the stated path
at the very least, partitionBy should fail. Currently its a silent bug.
encounter the same problem
Any update on this from the developers? See the same issue at 1.15.0. Does anyone have a workaround? I suppose I could partition the data then save the partitions sequentially, but am wondering if there are better approaches.
I've run into this issue as well. Using partitionBy with all suggested workaround still results in one partition. Is anyone working on this issue?
+1 @boarder7395
I suppose the tensorflow dastasource takes the whole dataframe as an input and don;t respect the partitionBy clause.
Is there a workaround other than writing each partition data separately in a loop.
We open sourced a similar package to address this issue. You can try it out here. https://github.com/linkedin/spark-tfrecord https://engineering.linkedin.com/blog/2020/spark-tfrecord
I want to split my data evenly, so I add an column
index
to my dataframe, and I am pretty sure this column is added correctly. I printed some rows:I firstly add the index using the code below:
then I want to partition by index_id:
but this code output only one partition everytime. I thougt is could be something wrong with dataframe, but when I output to another format, it works as expected. e.g. csv: