iterative / datachain

AI-data warehouse to enrich, transform and analyze unstructured data
https://docs.datachain.ai
Apache License 2.0
1.99k stars 90 forks source link

`batch` and `batch_map` are broken #84

Closed dberenbaum closed 3 months ago

dberenbaum commented 4 months ago

Description

batch and batch_map do not work and it's not clear what the syntax should be.

Take this example:

DataChain.from_storage(path="gs://dvcx-datalakes/dogs-and-cats/").settings(
        batch=10
).map(
        lambda file: len(file.parent + file.name),
        # lambda files: [len(file.parent + file.name) for file in files],
        params=["file"],
        output={"path_len": int}
).show()

It fails with error TypeError: DatasetQuery.add_signals() got an unexpected keyword argument 'batch'. Do we want to be passing the batch size to the non-batched map() method, or should we always assume batch size is 1 here? If batch > 1, should the udf expect batched inputs and outputs?

Do we need support for this and batch_map()? It's not clear what each one should do.

Note that batch_map() now looks like it's just a copy of gen() and fails with the same error.

dmpetrov commented 4 months ago

Do we need support for this and batch_map()

There is no urgent need in this since we support setup(). Let's exclude these from public API for now.

PS: It suppose to work like this:

 # lambda cannot be a generator, so, use function
def func(file) -> Iterator[int]:
    for f in file:
        yield len(f.parent + f.name)

(DataChain.from_storage(path="gs://dvcx-datalakes/dogs-and-cats/")
    .settings(batch=10)
    .map(
        path_len=func, 
).show())