Open dmpetrov opened 2 months ago
A couple questions:
DatasetQuery
DataChain
(base class, attribute, something else...)?Also, DatasetQuery
and DataChain
are no longer compatible with one another, and build different kinds of datasets. One could be used as a superset of the other, but with the introduction of signals_schema
, that is no longer true.
The decisions here might affect https://github.com/iterative/dvcx/issues/1597 and similar other issues.
DataChain should become a pipeline. Something like Parameterized Pipeline in DVC but in pythonic way. While DatasetQuery (name is still not good) is a lower level structure for running DB queries.
- what's its expected relationship with
DataChain
It's likely attribute
. And datachain might potentially have multiple datasets / datasetqueries.
DatasetQuery
andDataChain
are no longer compatible with one another,
💯 That's one of the reasons why these needs to be decoupled.
DatasetQuery
is currently much more than just running DB queries so we would need to remove most of the logic there. My question is why do we even need 2 levels of API. It just complicates things and requires 2 levels of tests etc. If we need something to just run DB queries we have metastore / warehouse classes that are doing that atm.
Why don't we just merge those two i.e move needed stuff from DatasetQuery
to Datachain
and remove DatasetQuery
altogether?
DatasetQuery
is currently much more than just running DB queries so we would need to remove most of the logic there.
DatasetQuery
is an internal API, and is not going to be exposed to the users.
So, as you have said, we should remove all the logic that does anything other than working with queries/udfs, etc.
Why don't we just merge those two i.e move needed stuff from
DatasetQuery
toDatachain
and removeDatasetQuery
altogether?
Dmitry wanted to have an internal abstraction over sqlalchemy that handles all the querying, udfs, chunking, etc. Querying can be done through sqlalchemy already as you have mentioned, but DatasetQuery
needs to do more than that, which to me makes sense.
Tbh, I only learnt about it this week, that DatasetQuery
is not a user-focused API.
I have added some APIs like from_dataframe
/from_pandas
etc. that is no longer useful, and can be removed. Similarly, to_records()
and other IO related APIs can be moved to DataChain
or in some other places.
@skshetry It was user-focused API but I would say it's just deprecated recently in favor of the new Datachain
Since we're now getting rid of most of the code that expected DatasetQuery
to be user-facing, it might be time to tackle this. A simple first step should be to just remove the inheritance relationship and add a _query: DatasetQuery
attribute to DataChain
(and replace all super calls, etc.) That should make it easier to see which parts should be moved from DatasetQuery
and what its exact role (and name!) should be.
It's needed for:
ToDo: