iterative / datachain

DataChain 🔗 AI-dataframe to enrich, transform and analyze data from cloud storages for ML training and LLM apps
https://datachain.dvc.ai
Apache License 2.0
682 stars 35 forks source link

Combine order_by() and mutate() #117

Closed dmpetrov closed 1 month ago

dmpetrov commented 1 month ago

Description

We need to make this statement work .order_by(dist=cosine_distance(Column("emd"), REMOVAL_TARGET)) instead of two:

    .mutate(dist=cosine_distance(Column("emd"), REMOVAL_TARGET))
    .order_by(dist)

It seems straight forward to implement.

More detailed use case:

model = PaliGemmaForConditionalGeneration.from_pretrained("google/paligemma-3b-mix-224")
processor = AutoProcessor.from_pretrained("google/paligemma-3b-mix-224")

def process_image(file) -> str:
   image = file.get_value().convert("RGB")
   inputs = processor(text="caption", images=image, return_tensors="pt")
   generate_ids = model.generate(**inputs, max_new_tokens=100)
   return processor.batch_decode(generate_ids, skip_special_tokens=True,
                                 clean_up_tokenization_spaces=False)[0]

chain = (
      DataChain.from_storage("gs://datachain-demo/newyorker_caption_contest/images")
      .map(scene=process_image)
      .save("image_captions")
)

def get_embedding(text: str) -> list[float]:
    return openai.Embedding.create(input=text,
                                   model="text-embedding-ada-002")

REMOVAL_TARGET = get_embedding("green grass scene")
chain = DataChain.from_dataset("image_captions")

images_to_remove = (
    chain.map(emd=get_embedding("scene"))
    .order_by(dist=cosine_distance(Column("emd"), REMOVAL_TARGET))
    .limit(15)
)

cleansed = chain.subtract(images_to_remove)
dmpetrov commented 1 month ago

I'm set it to P0 since it's required for the release

mnrozhkov commented 1 month ago

@dmpetrov, it looks good and may help in some scenarios! At the same time it doesn't look like a must have TBH. could you elaborate on the importance of this change? I have some concerns:

  1. Separating mutate and order_by improves readability and maintainability.
  2. This change may overload the order_by method's purpose.
  3. It might be premature optimization. Do we have user feedback suggesting this would significantly improve UX?
ilongin commented 1 month ago

I assume that dist in the example must be preserved in the chain and to me it's strange that order_by creates new columns / signals ... IMO creating them should be explicit, which mutate is for and order_by should have only one job - to order results.

ilongin commented 1 month ago

If we just need to order by some function results, that's already possible but we need to fix resolving nested column names in function arguments (Created an issue for that https://github.com/iterative/datachain/issues/125)

Example:

from datachain.sql.functions.string import length
from datachain.lib.dc import C, DataChain

names = ["aa.txt", "aaa.txt", "a.txt", "aaaaaa.txt", "aa.txt"]

assert (
    DataChain.from_values(file=[File(name=name) for name in names])
    .order_by(length(C("file__name")))  # need to fix this to accept file.name instead
    .collect_one("file.name")
) == ["a.txt", "aa.txt", "aa.txt", "aaa.txt", "aaaaaa.txt"]
shcheklein commented 1 month ago

Changing to P1 since it's not about the release anymore.

shcheklein commented 1 month ago

per @dmpetrov - need a bit more input here.

dmpetrov commented 1 month ago

I assume that dist in the example must be preserved in the chain

No. User should not see the signal after the command is done. The signal name is optional here. That's a better code example: .order_by(cosine_distance(Column("emd"), REMOVAL_TARGET))

    DataChain.from_values(file=[File(name=name) for name in names])
    .order_by(length(C("file__name")))  # need to fix this to accept file.name instead

This looks good! Could you please check if this will work with .order_by(cosine_distance(Column("emd"), REMOVAL_TARGET))? If works - that's exactly what we need.

ilongin commented 1 month ago

This looks good! Could you please check if this will work with .order_by(cosine_distance(Column("emd"), REMOVAL_TARGET))? If works - that's exactly what we need.

Yes, this works. Here is simple example with nested column names as function arguments:

from pydantic import BaseModel
from datachain.sql.functions.array import cosine_distance
from datachain.lib.dc import C, DataChain
from datachain.lib.data_model import DataModel

target = [0.1, 0.2]

class Embedding(BaseModel):
    id: int
    values: list[float]

DataModel.register(Embedding)

assert list(
    DataChain.from_values(
        embedding=[
            Embedding(id=1, values=[0.1, 0.3]),
            Embedding(id=2, values=[0.1, 0.5]),
            Embedding(id=3, values=[0.1, 0.4]),
            Embedding(id=4, values=[0.1, 0.7]),
            Embedding(id=5, values=[0.1, 0.6]),
        ],
    )
    .order_by(cosine_distance(C("embedding.values"), target))
    .collect("embedding.id")
) == [1, 3, 2, 5, 4]
dmpetrov commented 1 month ago

Amazing! That's exactly what's needed 🙂

Closing.