iterative / datachain

AI-data warehouse to enrich, transform and analyze unstructured data
https://docs.datachain.ai
Apache License 2.0
1.92k stars 87 forks source link

pre_fetch option in additional to cache for lib.File #40

Open dmpetrov opened 7 months ago

dmpetrov commented 7 months ago

We need to download items in async mode before processing them:

chain.settings(pre_fetch=2, cache=True, parallel=True).gen(laion=process_webdataset(spec=WDSLaion))

OUTDATED:

ds.generate(WebDataset(spec=WDSLaion), parallel=4, cache=True, pre_fetch=10)
rlamy commented 7 months ago

Note that with the current architecture, pre_fetch won't do much, since only one File object exists at a time (assuming no batching).

dmpetrov commented 7 months ago

@rlamy we should change it in a way that pre-caching helps.

shcheklein commented 3 months ago

Depends on the file API refactoring. Moving indexing to the app level. For now moving back to backlog.

shcheklein commented 1 month ago

Since we are done with indexing more or less, moving it back to the ready stage cc @rlamy . Might still depend on some work that Ronan is doing now with decoupling datasetquery and datachain.

One of the use cases I have atm is:


One thing that is a bit annoying is that some tools (OpenCV) seems to require a local path. Yes, cache helps in that case and pre-fetch can help - but both require downloading the whole file, while for some operations I just need some header. If someone has ideas how that can be improved - let me know. Is there a way to create file-like-looking object but that is a stream from the cloud underneath?

rlamy commented 1 month ago

Some notes:

This means that udf.run() should receive model instances, not raw DB rows, which requires some refactoring...

shcheklein commented 1 month ago

This means that udf.run() should receive model instances, not raw DB rows, which requires some refactoring...

where do we receive raw DB rows there? (I wonder if this related or should be taken into account - https://github.com/iterative/studio/issues/10531#issuecomment-2379390308 )

rlamy commented 1 month ago

After probably too much refactoring, I can confirm that this can be implemented inside udf.run() which means that:

Ignoring a lot of details, the basic idea is to change the implementation of udf.run() from this:

for db_row in udf_inputs:
    obj_row = self._prepare(db_row)
    obj_result = self.process(obj_row)
    yield self._convert_result(obj_result)

to this:

obj_rows = (self._prepare(db_row) for db_row in udf_inputs)
obj_rows = AsyncMapper(_prefetch_row, obj_rows, workers=pre_fetch)
for obj_row in obj_rows:
    obj_result = self.process(obj_row)
    yield self._convert_result(obj_result)

where prefetch_row looks like

async def prefetch_row(row):
    for obj in row:
        if isinstance(obj, File):
            await obj._prefetch()
    return row

Note that the latter can easily be generalised to arbitrary models, if we define some kind of "prefetching protocol".

dmpetrov commented 1 month ago

this can be implemented inside udf.run()

It looks like the right way of solving this. Thank you!

rlamy commented 4 weeks ago

The proposed implementation has a problem: it hangs when run in distributed mode, i.e. when using something like .settings(prefetch=2, workers=2). Here's what happens (with some simplifications!) when running a mapper UDF in that case:

Possible solutions

rlamy commented 3 weeks ago

Using threading in AsyncMapper.produce() runs into the issue that iteration needs to be thread-safe, but that seems fixable, see #521. That PR only deals with Mapper and Generator though. Regarding the other 2 classes:

shcheklein commented 3 weeks ago
    def get_inputs(self):
        while (batch := get_from_queue(self.task_queue)) != STOP_SIGNAL:
            yield batch

minor observation - batch can be renamed here - it's not really a batch, right?


Aggregator and BatchMapper should be related to each other, no? both send iterate probably on batches of rows and send them to UDF?

I think prefetch still makes sense (can start fetching the next batch?). I think definitely can be a followup / separate ticket to discuss and prioritize.