Open utkarshgupta137 opened 2 years ago
I am curious why batching can not be done on user side? I don't see the benefit of doing it inside the converter. Assuming you will feed the examples to training/test/eval, won't TF handle batching automatically?
The difference in file size of say 1000 Example vs SequenceExample of 1000 rows is very high (unbatched data is ~50% larger in my case). Thus, it takes longer to read/write the files as well as increases memory/disk space requirements.
Which Spark operation does batching correspond to? GroupBy? Spark-TFRecord is implemented as a Spark data source (similar to Avro, Parquet, CSV), so it supports most data source options. I don't see batching in Spark's data source API. TFRecordReader does batching, why is it not an option for you?
Batching can be implemented by adding an index to all the rows & then assigning a batch to each row using batch = index % batch_size
.
Yes, TFRecordReader supports batching, but the whole point of doing it in spark is mentioned in my last comment.
It's not clear to me how to implement the logic in a Spark data source which basically is a format converter. Contributions are welcome.
It would be great if this library could automatically create batches & save them using SequenceExample. I tried to batches myself, but I got memory issues when trying to do so. I think if it was handled properly at the partition level, then it would both be faster & easy to use.