iterative / datachain

AI-dataframe to enrich, transform and analyze data from cloud storages for ML training and LLM apps
https://docs.datachain.ai
Apache License 2.0
768 stars 46 forks source link

Limit the listing scope by the storage directory specified in the URI #180

Closed mnrozhkov closed 2 months ago

mnrozhkov commented 2 months ago

Description

When running the code below for the first time, it takes too much time to list. I guess it's because Datachain does listing all files in the bucket. I think it makes sense to limit it to the directory specified in the URI. Is it correct?

It becomes more crucial for CI/CD pipelines where I need to run a test script. Listing all files in the bucket is unnecessary.

Code to reproduce
(from examples/computer_vision/fashion_product_images/1-quick-start.ipynb)

# Create a DataChain

dc = (
    DataChain.from_storage(
        "gs://datachain-demo/fashion-product-images", type="image", anon=True
    )
    .filter(C("file.name").glob("*.jpg"))
    .save()
)

Output:

image

Version Info

No response

shcheklein commented 2 months ago

@mnrozhkov workaround is to put / at the end of the URL.

Improvement to the semantics is here: https://github.com/iterative/datachain/pull/108

shcheklein commented 2 months ago

Hey, @iterative/datachain please review https://github.com/iterative/datachain/pull/108 (cc @ilongin )

shcheklein commented 2 months ago

Should be fixed in the latest release.