Closed lappemic closed 4 months ago
Looks good to me ! :)
you might want to add the map
num_proc argument as well, for people who want to make it run faster
Thanks for the feedback @lhoestq! The last commits include:
num_proc
parameter to batch
IterableDataset.batch()
Stream
page. But could not find a better place atm. Where would you put this documentation?WDYT?
You can put the documentation in process.mdx :)
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
I reset the head to the commit before I added the Dataset.batch()
documentation to stream.mdx
and instead added the documentation to process.mdx
.
This PR introduces a new
batch
method to theDataset
class, aligning its functionality with theIterableDataset.batch()
method (implemented in #7054). The implementation uses as well the existingmap
method for efficient batching of examples.Key changes:
batch
method toDataset
class inarrow_dataset.py
map
method for batchingCloses #7063
Once the approach is approved, i will create the tests and update the documentation.