Closed lappemic closed 4 months ago
Cool ! Thanks for diving into it :)
Your implementation is great and indeed supports shuffling and batching, you just need to additionally account for state_dict (for dataset checkpointing+resuming)
That being said, I believe the implementation can be made simpler by relying on IterableDataset.map()
which already implements all this. Maybe something like
def batch(self, batch_size: int, drop_last_batch: bool = False) -> "IterableDataset":
def batch(unbatched: dict[str, list]) -> dict[str, list]:
return {k: [v] for k, v in unbatched}
return self.map(batch, batched=True, batch_size=batch_size, drop_last_batch=drop_last_batch)
And this way no need to reimplement everything !
(my only small concern is that it's not an Arrow-optimized function so it requires the examples to be manipulated as python objects even if the original data is in Arrow format (e.g. when streaming Parquet files) but it's not a big deal and we can see later if we need to optimize this)
Thanks a lot for the feedback @lhoestq! I definitely could have saved some time looking into it properly first. 😅
Implemented the .batch()
method, added a proper docsrtring for documentation, and added tests.
Let me know what you think and if this needs some update.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Thanks for the feedbak @lhoestq!
Applied it and referenced the batched=True
option in the map
function and highlighted the difference. Hope i got this right.
I've taken a try at implementing a batched
IterableDataset
as requested in issue #6279. This PR adds a newBatchedExamplesIterable
class and a.batch()
method to theIterableDataset
class.The main changes are:
BatchedExamplesIterable
that groups examples into batches..batch()
method forIterableDataset
to easily create batched versions.I'm not sure if this is exactly what you had in mind and also have not fully tested it atm, so I'd really appreciate your feedback. Does this seem like it's heading in the right direction? I'm happy to make any changes or explore different approaches if needed.
Pinging @lhoestq