iterative / datachain

AI-dataframe to enrich, transform and analyze data from cloud storages for ML training and LLM apps
https://docs.datachain.ai
Apache License 2.0
702 stars 37 forks source link

Introduce @ operator for File path manipulations #299

Closed volkfox closed 3 days ago

volkfox commented 4 weeks ago

Description

There is a number of common operations to File path, e.g. {stem, dir, basename, ext} which are poorly supported.

For example, if we need to merge two datasets "images" and "meta" by the file stem, we currently require many manipulations to create and remove merge keys – which is not ideal:

import datachain.sql.functions.path

images = images.mutate(images_stem = path.stem("file.path")
meta = meta.mutate(meta_stem = path.stem("file.path")

annotated_images = images.merge(meta, on="image_stem", right_on="meta_stem")

annotated_images.select_except(["image_stem", "meta_stem"])

A preferred way is to create path-related functions on the fly:

annotated = images.merge(meta, on="file.path@stem", right_on="meta_file.path@stem")
dmpetrov commented 4 weeks ago

That's amazing idea! We definitely need to simplify usage of the vectorized operations.

annotated = images.merge(meta, on="file.path@stem", right_on="meta_file.path@stem")

There are possible two approaches:

  1. Apply general functions to columns: on="file.path@stem" (applies built-in stem() to columns)
  2. Introduce special methods in data-model classes: on="file@stem" (where stem() is object method - it's a special method that does the translation). @shcheklein was sharing similar ideas in the latest night-sync-up meeting.

(1) seems straightforward and we should start with this while (2) has a bigger potential - users can define custom vectorized operation - but it requires additional research.

mattseddon commented 2 weeks ago

Is there a reason some of the DataChain methods accept strings (e.g. merge) and others accept Columns (e.g. filter)?

If merge accepted Columns in the same way that filter does then the syntax could be as follows:

images.merge(meta, on=path.file_stem(C("file.path")), right_on=path.file_stem(C("meta_file.path")))

Using this approach we wouldn't have to map our custom SQL functions to text identifiers and/or search for @ symbols in the provided strings.

We could also write a wrapper function that takes a string and returns the required SQL function. I.e

def stem(column_name):
    return path.file_stem(C("file.path"))

which would make the final merge syntax something like:

images.merge(meta, on=stem("file.path"), right_on=stem("meta_file.path"))

Using columns though would mean that users can join using any SQL function available through SQLAlchemy.

mattseddon commented 2 weeks ago

or do we want to move all functions away from the C syntax?

dmpetrov commented 1 week ago

It's the best we can do as the 1st step. Later, we will figure out if anything else is needed or this is enough.

mattseddon commented 1 week ago

I've still got a few things to work through but this is how I think the wds example will end up looking:

wds_images = (
    DataChain.from_storage(IMAGE_TARS)
    .settings(cache=True)
    .gen(laion=process_webdataset(spec=WDSLaion), params="file")
)

wds_with_pq = (
    DataChain.from_parquet(PARQUET_METADATA)
    .settings(cache=True)
    .merge(wds_images, on="uid", right_on="laion.json.uid", inner=True)
)

wds_npz = (
    DataChain.from_storage(NPZ_METADATA)
    .settings(cache=True)
    .gen(emd=process_laion_meta)
)

res = wds_npz.merge(
    wds_with_pq,
    on=[path.file_stem(wds_npz.c("emd.file.path")), "emd.index"],
    right_on=[path.file_stem(wds_with_pq.c("source.file.path")), "source.index"],
    inner=True,
).save("wds")

res.show(5)

Is this acceptable?

mattseddon commented 1 week ago

I've updated both examples/multimodal/clip_inference.py & examples/multimodal/wds.py in #388 to showcase the new syntax. This update also gives us a performance boost. PTAL.