Closed GuillaumeDesforges closed 11 months ago
From conversations on Slack with @clarkzinzow :
Clark Zinzow (Ray team) :
Datasets always kicks off reading of the first block, so if you’re givingparallelism=1
, that means that it will read all provided paths into one block, which will always be eagerly read. We do this eager read so users can rapidly introspect schema and other metadata, along with the first few rows of the data. If you were to provideparallelism=len(paths)
, this should read at most one file.Guillaume Desforges :
I don't get access tolen(paths)
as those are computed inside ofFileBasedDataSource.prepare_read
https://github.com/ray-project/ray/blob/6560a2bf5df02ba7a975c249698f90bd850c2390/python/ray/data/datasource/file_based_datasource.py#L270
Users will pass a single path which is the top directory of the data directory
So the actual len(paths) is not available at the time of calling read_datasource. I could ask users to expand the path outside and set parallelism, but i believe the UX of ray would just benefit from making a block per expanded file path instead of parallelism
Using a very high value for parallelism
crashes unexpectedly: #24296
From my investigations, a datasource can't be lazily generating data.
A datasource creates a list of ReadTask
which are then consumed by ray.data.read_datasource
.
https://github.com/ray-project/ray/blob/026849cd272013003212159c1a12de163449b761/python/ray/data/read_api.py#L227
https://github.com/ray-project/ray/blob/026849cd272013003212159c1a12de163449b761/python/ray/data/read_api.py#L259
https://github.com/ray-project/ray/blob/026849cd272013003212159c1a12de163449b761/python/ray/data/read_api.py#L263-L264
In ReadTask's __call__
, all blocks are evaluated until exhaustion from the read_fn
.
https://github.com/ray-project/ray/blob/026849cd272013003212159c1a12de163449b761/python/ray/data/datasource/datasource.py#L156-L159
So, when calling dataset.iter_batches
, even if the passed read_fn
is designed to yield
results, like in FileBasedDatasource
, ray will compute all the blocks until it (read_fn
) is done.
Also, since read_datasource
relies on LazyBlockList
,
https://github.com/ray-project/ray/blob/026849cd272013003212159c1a12de163449b761/python/ray/data/read_api.py#L259
in order for laziness to be possible, one would need LazyBlockList.iter_blocks_with_metadata
not to expect 1 task to produce 1 block
https://github.com/ray-project/ray/blob/026849cd272013003212159c1a12de163449b761/python/ray/data/impl/lazy_block_list.py#L403-L406
So there would be some core changes to do in order to achieve laziness of Datasource at the task level
@GuillaumeDesforges What you're looking for here is for each ReadTask
to be able to yield one or more blocks and for downstream operations to be able to start execution on those individual blocks; i.e., you're wanting to be able to pipeline execution within a single ReadTask
's blocks.
in order for laziness to be possible, one would need LazyBlockList.iter_blocks_with_metadata not to expect 1 task to produce 1 block
This aspect is already supported via dynamic block splitting (disabled by default due to some remaining fault-tolerance and performance work), where a ReadTask
is able to dynamically (at read task execution time) produce more than one block. See the code path here, where we treat the ReadTask
output as a partition containing one or more blocks.
What's missing is allowing the read stage to yield control to downstream operations and pipeline execution on these blocks.
Hi! In ray 2.4 changelog i can see https://github.com/ray-project/ray/releases/tag/ray-2.4.0
We've implemented asynchronous batch prefetching of Dataset.iter_batches (doc), improving performance by fetching data in parallel while the main thread continues processing, thus reducing waiting time.
Does it address this issue?
Hey @anyscalesam, I'm excited to see this completed!
Out of curiosity, would you mind linking the commit that implements it? I'm curious to know how you've handled this :eyes:
Description
The file based datasource should only read block by block, and not block for a whole list of files.
My understanding is that there are only as many
ReadTask
s asparallelism
: https://github.com/ray-project/ray/blob/6560a2bf5df02ba7a975c249698f90bd850c2390/python/ray/data/datasource/file_based_datasource.py#L331-L333and a
ReadTask
is ran from start to end in one go.Use case
I have many (>1000) image files of big images (>1GB) from which I can generate ~500 "patch" images (using openslide for instance), which I use in a machine learning pipeline.
I want to start training my model as data is loading, so the snippet
should not load all the data before yielding one batch.