iterative / datachain

AI-dataframe to enrich, transform and analyze data from cloud storages for ML training and LLM apps
https://docs.datachain.ai
Apache License 2.0
727 stars 39 forks source link

Decouple DataChain and DatasetQuery #41

Open dmpetrov opened 2 months ago

dmpetrov commented 2 months ago

It's needed for:

ToDo:

rlamy commented 2 months ago

A couple questions:

skshetry commented 2 months ago

Also, DatasetQuery and DataChain are no longer compatible with one another, and build different kinds of datasets. One could be used as a superset of the other, but with the introduction of signals_schema, that is no longer true.

The decisions here might affect https://github.com/iterative/dvcx/issues/1597 and similar other issues.

dmpetrov commented 2 months ago

DataChain should become a pipeline. Something like Parameterized Pipeline in DVC but in pythonic way. While DatasetQuery (name is still not good) is a lower level structure for running DB queries.

  • what's its expected relationship with DataChain

It's likely attribute. And datachain might potentially have multiple datasets / datasetqueries.

DatasetQuery and DataChain are no longer compatible with one another,

💯 That's one of the reasons why these needs to be decoupled.

ilongin commented 2 months ago

DatasetQuery is currently much more than just running DB queries so we would need to remove most of the logic there. My question is why do we even need 2 levels of API. It just complicates things and requires 2 levels of tests etc. If we need something to just run DB queries we have metastore / warehouse classes that are doing that atm.

Why don't we just merge those two i.e move needed stuff from DatasetQuery to Datachain and remove DatasetQuery altogether?

skshetry commented 2 months ago

DatasetQuery is currently much more than just running DB queries so we would need to remove most of the logic there.

DatasetQuery is an internal API, and is not going to be exposed to the users. So, as you have said, we should remove all the logic that does anything other than working with queries/udfs, etc.

Why don't we just merge those two i.e move needed stuff from DatasetQuery to Datachain and remove DatasetQuery altogether?

Dmitry wanted to have an internal abstraction over sqlalchemy that handles all the querying, udfs, chunking, etc. Querying can be done through sqlalchemy already as you have mentioned, but DatasetQuery needs to do more than that, which to me makes sense.


Tbh, I only learnt about it this week, that DatasetQuery is not a user-focused API. I have added some APIs like from_dataframe/from_pandas etc. that is no longer useful, and can be removed. Similarly, to_records() and other IO related APIs can be moved to DataChain or in some other places.

ilongin commented 2 months ago

@skshetry It was user-focused API but I would say it's just deprecated recently in favor of the new Datachain

rlamy commented 3 days ago

Since we're now getting rid of most of the code that expected DatasetQuery to be user-facing, it might be time to tackle this. A simple first step should be to just remove the inheritance relationship and add a _query: DatasetQuery attribute to DataChain (and replace all super calls, etc.) That should make it easier to see which parts should be moved from DatasetQuery and what its exact role (and name!) should be.