linkedin / spark-tfrecord

Read and write Tensorflow TFRecord data from Apache Spark.
BSD 2-Clause "Simplified" License
286 stars 57 forks source link

[Feature Request] Add option to batch data when using SequenceExample #51

Open utkarshgupta137 opened 2 years ago

utkarshgupta137 commented 2 years ago

It would be great if this library could automatically create batches & save them using SequenceExample. I tried to batches myself, but I got memory issues when trying to do so. I think if it was handled properly at the partition level, then it would both be faster & easy to use.

junshi15 commented 2 years ago

I am curious why batching can not be done on user side? I don't see the benefit of doing it inside the converter. Assuming you will feed the examples to training/test/eval, won't TF handle batching automatically?

utkarshgupta137 commented 2 years ago

The difference in file size of say 1000 Example vs SequenceExample of 1000 rows is very high (unbatched data is ~50% larger in my case). Thus, it takes longer to read/write the files as well as increases memory/disk space requirements.

junshi15 commented 2 years ago

Which Spark operation does batching correspond to? GroupBy? Spark-TFRecord is implemented as a Spark data source (similar to Avro, Parquet, CSV), so it supports most data source options. I don't see batching in Spark's data source API. TFRecordReader does batching, why is it not an option for you?

utkarshgupta137 commented 2 years ago

Batching can be implemented by adding an index to all the rows & then assigning a batch to each row using batch = index % batch_size. Yes, TFRecordReader supports batching, but the whole point of doing it in spark is mentioned in my last comment.

junshi15 commented 2 years ago

It's not clear to me how to implement the logic in a Spark data source which basically is a format converter. Contributions are welcome.