iterative / datachain

DataChain 🔗 AI-dataframe to enrich, transform and analyze data from cloud storages for ML training and LLM apps
https://datachain.dvc.ai
Apache License 2.0
686 stars 35 forks source link

Resolve nested column names in SQL functions #125

Closed ilongin closed 1 month ago

ilongin commented 1 month ago

We do have decorator that resolves nested column / signal names in DataChain method arguments (e.g file.name -> file__name), but it doesn't work if columns are arguments to a SQL function.

Example that doesn't work, but should if this is fixed:

from datachain.sql.functions.string import length
from datachain.lib.dc import C, DataChain

names = ["aa.txt", "aaa.txt", "a.txt", "aaaaaa.txt", "aa.txt"]
dc = DataChain.from_values(file=[File(name=name) for name in names])
dc = dc.order_by(length(C("file.name")))   # this works if file__name is set instead.

assert dc.collect_one("file.name") == ["a.txt", "aa.txt", "aa.txt", "aaa.txt", "aaaaaa.txt"]
rlamy commented 1 month ago

This will become even more important after #101, since accessing the file name will require something like name(C("file.path")).

ilongin commented 1 month ago

Looks like this already works as C from datachain.lib.dc resolves name by replacing . to __ automatically so no additional changes needed and I'm closing the issue.