Closed inigohidalgo closed 4 months ago
@inigohidalgo thank you for putting a lot of thought into this -
I spent last night reading about Ibis which I think may be the modern and correct way to approach the problem space you're trying to work with.
I'm still not sure how to make the syntax work neatly in Kedro world, but I'm hoping it can work like this polars dataset we nearly have merged.
@inigohidalgo while this doesn't solve for your more detailed use case, have you looked at using globals?
This is how I currently handle the use case of updating my query for specific dates, and I believe they can be passed at runtime.
Thanks for the heads up about Ibis @datajoely, it does seem to be the direction we could go in, though it seems to go a step further than what we'd need (full computational backend vs just simple filtering, joining etc). I would note I haven't been able to connect to our Azure SQL databases using our usual connection method (pyodbc), so I'm not sure how mature it is, in case you're considering putting development towards it. I might open an issue in the project regarding this if it seems promising: basically at the moment they only provide the pymssql driver for connecting to mssql DBs, but it doesn't seem to be compatible with our authentication method (Azure AD).
Hey @bpmeek yeah that's the "compromise" solution we'd been using until now, basically just passing it as a general parameter. But it isn't really flexible enough for what we're trying to do a lot of the time.
For the moment I haven't made progress on this since I've been sidetracked on other issues, but it is still very much an important question for my team and me, and I am still considering pursuing my original idea since it seems simple enough to implement through SQLAlchemy which is much more mature and developed.
Hey all, I'm the lead developer of ibis!
We're happy to support alternative methods of connecting to mssql-oriented DBs. We haven't added that capability yet because no one has asked for it!
If that'll tip the scales in favor of ibis then I think we'd be happy to work on adding ODBC support to the MSSQL backend.
Thanks for chiming in @cpcloud! ππ½ We haven't started exploring Ibis for real yet, but if the only limitation is mssql DBs, maybe we could build a proof of concept with PostgreSQL or something else. We'll surely reach out soon if we make progress on this front.
Went ahead and built a small poc using duckdb (link). I really like Ibis' API for achieving what I want, much simpler than SQLAlchemy. I will keep testing it, but it definitely seems like it could be the way forward for us.
I will open a separate issue in the appropriate repo when I investigate further, but: the ibis.Table
object works flawlessly on the first node I load it (filter_dates
); if I call .execute
on it, it returns the filtered dataframe as expected. But if I pass the filtered Table object into another node to then execute, like I do here in ibis_table_to_df
it throws AttributeError: 'Backend' object has no attribute 'con'
Full traceback:
(Note: I tested it with both duckdb and sqlite as a backend and ran into the same issue)
I haven't been able to dig too deep into what is causing it, but I suspect it could be related to how kedro passes datasets from node to node (depending on the MemoryDataset copy_mode arg, deepcopy
by default), as also seen in this issue kedro-org/kedro#730. Maybe the backend is somehow losing its con
object when being deepcopied?
Hi @inigohidalgo, sorry it took some time to get back, but thanks a lot for making this POC! This week I spoke to @datajoely and @marlenezw about Ibis & Kedro and I'm thrilled to see this is moving forward.
Two comments:
.execute()
, which is hardcoded to return a pandas.DataFrame
, you can be more explicit and use .to_pandas
, .to_pyarrow
and so on, see https://github.com/ibis-project/ibis/issues/6351#issuecomment-1590882862.con
property is not part of __init__
, but later created on the fly https://github.com/ibis-project/ibis/blob/6d4a34f74bf4880d33495ad94d44d58fdd97c2b0/ibis/backends/base/sql/alchemy/__init__.py#L122 cc @cpcloud
copy_mode
arg to assign
to see if that makes it passIdeally there should be a repo that only contains the IbisDataSet
that can be pip
-installed, and then a showcase Kedro project that uses that (rather than putting the code in the extras
subpackage). If you need help for the final stretch, ping me and I'll be happy to help!
I'd try out setting the copy_mode arg to assign to see if that makes it pass
Yeah that solves it, but is kind of cumbersome to have to specify copy_mode for each Ibis memorydataset I'd be passing through, as at the moment I am not specifying any memorydatasets at all in my catalog (note my comments in the kedro slack). I suggested more flexibility in the _infer_copy_mode
function, as at the moment it is arbitrarily flexible for pandas and numpy, and assigns everything else to deepcopy
by default. That is only a tangential issue to IbisDataSet anyways, and I am still interested in implementing it regardless of needing to specify copy_mode in parameters.
I like the idea of sharing the IbisDataSet as a standalone package, and I appreciate the comment regarding .to_pandas
and .to_pyarrow
, that particularity was highlighted by @gforsyth earlier today :) Although that is part of the node and not part of the dataset, which only returns an ibis.Table
which I then operate on.
Do you think the IbisDataSet as-is would be in a state to publish? With the required tests and documentation obviously, though all the necessary methods are implemented already, though potentially in a barebones manner.
Do you think the IbisDataSet as-is would be in a state to publish? With the required tests and documentation obviously, though all the necessary methods are implemented already, though potentially in a barebones manner.
Release early, release often π
I extracted the IbisDataSet and filled out the missing functionality here: https://github.com/inigohidalgo/kedro-ibis-dataset
I also updated the PoC to use the external dataset package instead of the bundled extra: https://github.com/inigohidalgo/ibis-kedro-poc
I'd appreciate feedback on the implementation :)
This is super cool @inigohidalgo - until now SQL has always been a 2nd class citizen in Kedro and this is a huge step forward in terms of usability and robustness.
Longer term it would be cool to think about how to enable more comprehensive ibis support in Kedro. A first step may be to bring this into the kedro-datasets
repository under something like ibis.SQLDataFrameDataSet
?
I'd love to provide blanket support for all ibis datasets, the problem I'm not sure how to solve is how to manage optional dependencies twice (e.g. pip install ibis[pyspark]
and kedro[spark.SparkDataSet]
) from both user workflow and project maintenance burden perspectives.
AFAIK @astrojuanlu is the python packaging π so he might have more of an insight into what is possible using the new PEP621 metadata, but bringing it into kedro-datasets would probably complicate that a bit more than keeping it as a separate package (or subpackage within the kedro-extras repo), which would allow for specifying different dependency groups for kedro-ibis-dataset aligned with those Ibis defines
As far as project maintenance goes, a GH action which parses Ibis' dependency groups and reproduces those into kedro-ibis-dataset groups shouldn't be too complicated to implement, though it might be kind of brittle if Ibis decides to move onto a package manager different from poetry. Though I do know the tool I'm using (PDM) has a workflow to import poetry dependencies into PEP621-compatible deps.
Sidenote: is there a better place we could be having this discussion? I guess the package would solve this issue, but maybe visibility is limited buried inside an issue.
I'd love to provide blanket support for all ibis datasets, the problem I'm not sure how to solve is how to manage optional dependencies twice
Yeah I don't think there's a way to have two levels of optional dependencies. We either include those as parts of the optional dependencies groups we already have:
# kedro-datasets/setup.py
extras_require["spark.SparkDataSet"] = [SPARK, HDFS, S3FS, "ibis[pyspark]"]
or we provide another group:
extras_require["spark.SparkDataSet"] = [SPARK, HDFS, S3FS]
extras_require["spark.SparkDataSet_lazy"] = [SPARK, HDFS, S3FS, "ibis[pyspark]"]
Bringing it into kedro-datasets would probably complicate that a bit more than keeping it as a separate package
We don't have a well-defined process on how a dataset should "graduate" to be part of kedro-datasets or whether we should control the maintenance burden in any way, so I'm improvising here. Since everybody can now do pip install kedro-ibis-dataset
, I'd say let's promote this so users start playing with it, iron the first issues that arise, and keep this thread open so that we can use those learnings to see how "lazy SQL filtering" can be a core part of Kedro.
Sidenote: is there a better place we could be having this discussion?
We can connect on #plugins-integrations
on the Kedro Slack!
I'd love to provide blanket support for all ibis datasets, the problem I'm not sure how to solve is how to manage optional dependencies twice
Yeah I don't think there's a way to have two levels of optional dependencies.
I think the more solid path forward would be to say "all things SQL in Kedro go through Ibis" and make it a "first level" optional dependency. No need to add more levels of indirection I'd say. To confidently commit to that plan, I propose we put this initial dataset in the hands of users and play around with it a bit more.
Dropping this here for future reference https://delta.io/blog/2023-06-15-using-ibis-pyspark-delta-lake/
we did recently add to_delta
and read_delta
methods to Ibis for DuckDB, Polars, and DataFusion backends. we have this issue for tracking for pyspark: https://github.com/ibis-project/ibis/issues/6383
let us know if prioritizing that would be helpful!
Thatβs an amazing addition for interoperability. I donβt use spark so I canβt comment on that, but the ability to read/write to delta without need of a cluster running is very valuable.Β
That's super cool @lostmygithubaccount
Now that kedro-ibis-dataset
demonstrates that this is possible, is there something that we can do at the Framework level to improve this functionality?
I think the next steps are using the package, finding bugs and corner cases, and promote it a bit more heavily. Maybe present a conference talk in 2024 :)
Other than that, any other loose ends? Or can we go ahead and close this issue?
π https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis π
Source code: https://github.com/deepyaman/jaffle-shop
So, technically this is possible. And now the question is what do we do about it:
kedro-datasets
?A third option is not doing anything, but I'd love to take one of the two paths.
We've discussed sometimes that, since Ibis has lots of different backends, managing that from within kedro-datasets
existing layer of optional dependencies might be a mess. But @deepyaman you have a more informed opinion.
cc @merelcht @noklam
So I have 3 cascading questions:
ibis.IbisDataSet
that supports all back-ends, SQL/DataFrame whatever?ibis.SQLDataset
and ibis.DataFrameDataset
s make sense?pip install 'ibis-framework[oracle]'
kedro-datasets
everytime it spots a new Ibis release.Longer term I'd love to decrease the number of 1st party datasets we maintain on kedro-datasets
and delegate that.
From my experience, at the moment ibis comes with quite a lot of dependencies even by default, so adding it to kedro-datasets imo would be difficult to manage. I think there might be something that can be done on the ibis side to lighten that a bit (@cpcloud), but I wouldn't really see the current dependencies as something desirable to be installed by default.
Keeping it as a separate package and managing that complexity on a more 1:1 basis between the IbisDataset package and Ibis makes more sense to me.
Agree with @inigohidalgo. Installing the union of all supported backends' dependencies by default would place quite a burden on end users.
We're in the middle of removing sqlalchemy
as direct dependency in Ibis, which reduces the number of dependencies (direct and transitive) by quite a bit, but even after that with 20+ backends there's still a lot of dependencies if support for all backends were included in a single package.
We have to deal with roughly the same issue on our side, since most end users don't need 20 backends π
Could kedro-datasets take a similar approach to Ibis and use setuptools extras to allow users to install it like pip install kedro-datasets[snowflake]
(for example)?
Yes @cpcloud we have something like this currently in our setup.py (sorry, pyproject.toml coming soon!) and the main point I was making is that we don't want to double up.
Could kedro-datasets take a similar approach to Ibis and use setuptools extras to allow users to install it like pip install kedro-datasets[snowflake] (for example)?
Yeah they already do that AFAIK, but here if it were included within kedro-datasets it would have to be
https://github.com/kedro-org/kedro/issues/2374#issuecomment-1598561951
That would be easier to manage if you just pip installed kedro-ibis which directly had the different dependency groups which would align with ibis'.
Ah, right. Thanks for the link, I didn't realize we're dealing with two levels of optionality here.
Hm, I'm not entirely sure what the best option is here. Perhaps including it as part of the existing extras might be okay?
I'd be in favour of having a kedro-ibis
separate package, otherwise our choices will be constrained by the existing optional dependencies layer. It will be better for discoverability (see https://github.com/kedro-org/kedro-plugins/issues/401) and for tracking (no need to do anything fancy with telemetry, we have PyPI stats).
Keeping it as a separate package and managing that complexity on a more 1:1 basis between the IbisDataset package and Ibis makes more sense to me.
Essentially this π―
Longer term I'd love to decrease the number of 1st party datasets we maintain on kedro-datasets and delegate that.
In progress: https://github.com/kedro-org/kedro-plugins/issues/535
- Do we do a mega ibis.IbisDataSet that supports all back-ends, SQL/DataFrame whatever?
- If this isn't feasible would separate ibis.SQLDataset and ibis.DataFrameDatasets make sense?
This is the key design question I'd say.
I will defer to @deepyaman for his opinion on different backends, as I have only really worked with mssql and polars backends, but I'm not sure what the functional difference would be between the SQL and Dataframe datasets as the API should (here is where I will defer to Deepyaman and Phillip) be the same. Instantiating the connection will have its own parameters, but I know the team tries to keep ibis.connect as a common entrypoint for all backends.
The main complexity is ibis' different optional dependencies, and that would probably be the main maintenance burden within this hypothetical IbisDataset project, though it could probably be automated relatively easily, as we would only need to keep track of ibis' dependency groups and make the same groups available within IbisDataset just pointing to the correct ibis-framework[group]
in each.
Thanks @inigohidalgo.
It's largely correct that there are no differences in backend output for a given API assuming that that both backends implement the operation.
There's are two major places where there is a difference in output _when using to_pandas()
_.
The first is when dealing with array<T>
values: in some cases we're going through pyarrow, which will turn array<T>
values into numpy arrays, and in other cases you'll get a list. I don't remember off the top of my head which backends return which type of value.
The second is NULL output. This one is kind of a mess right now because Pandas has 3 options for representing NULL and we haven't been very rigorous in Ibis about enforcing a single one in to_pandas()
results.
We're going to address both of these soon, likely in the next quarter.
To solve the 2-level optional dependency conundrum, @deepyaman suggested
pip install kedro-datasets[ibis-{{ backend }}]
so we can keep the ibis dataset inside kedro-datasets (helping discoverability).
To solve the 2-level optional dependency conundrum, @deepyaman suggested
pip install kedro-datasets[ibis-{{ backend }}]
so we can keep the ibis dataset inside kedro-datasets (helping discoverability).
Turns out @inigohidalgo actually suggested the same thing 5 days ago. π€£ https://github.com/kedro-org/kedro/issues/2374#issuecomment-1921295984
Hah, right π credit where it's due @inigohidalgo !!
@deepyaman and I are in sync π€£
This implementation makes sense to me as long as the kedro team is happy to take on that complexity of keeping track of Ibis' published dependency groups, as we would have to maintain 1:1 parity of kedro-datasets[ibis-{{backend}}] for every backend which Ibis publishes
I might be wrong, but in principle shouldn't it be
[project.optional-dependencies]
...
ibis-pandas = ["ibis[pandas]"]
?
if we do this in a separate plug-in I wonder if we could automate the 1:1 parity with some sort of tooling which raises PRs in a dependabot like fashion
I might be wrong, but in principle shouldn't it be
[project.optional-dependencies] ... ibis-pandas = ["ibis[pandas]"]
?
ibis-framework[pandas]
, but yes.
The only exception is for Flink right now, where you have to include apache-flink
(the flink
extra doesn't have anything for Ibis, because Flink has some outdated dependencies).
This implementation makes sense to me as long as the kedro team is happy to take on that complexity of keeping track of Ibis' published dependency groups, as we would have to maintain 1:1 parity of kedro-datasets[ibis-{{backend}}] for every backend which Ibis publishes
I was hoping to avoid this, because you can programmatically access the list of Ibis backends. As such, I'd written https://github.com/kedro-org/kedro-plugins/blob/09493963cf854cc852ef2377023e2aae13504f71/kedro-datasets/setup.py#L19-L27.
However, with the move to a static pyproject.toml
in https://github.com/kedro-org/kedro-plugins/pull/570, this is no longer possible (to the best of my knowledge). We could specify extra_require
as a dynamic attribute, but that would basically undo the work of the linked PR.
In the absence of an automated solution, I'm willing to maintain parity as Ibis publishes additional backends for now, as this is fairly infrequent. Furthermore, the extras are a convenience function, and will not block somebody from using a new Ibis backend with Kedro.
If we were to have a seperate kedro-ibis
repo we could have a dependabot-like CI script which made PRs to pyproject.toml keeping Kedro and Ibis extras in sync?
I'm okay with closing this as the usecase is wonderfully solved by ibis now.
Description
We are pulling from some upstream tables that have a long history, and when running our feature engineering, training etc pipelines we only want to grab a certain window of data (last year for example). This isn't a static query and can change with each successive run of the pipeline.
Currently, the SQL Datasets only load either a static query or an entire table. I would like to be able to dynamically filter these tables based on certain conditions.
Context
It is important because currently if we are consuming upstream tables with 6 years of history but only care about the last year, it takes way too long for me to then immediately discard >80% of the ingested data.
Possible Implementation
An idea I am considering, thanks to @datajoely's feedback, is to create a new dataset type which when loaded returns an object which when called will actually load the data, working the same way as PartitionedDataSet. This object can be manipulated within a node to add filters to it before finally calling
pd.read_sql
or the spark equivalent. This is a possible implementation which is working for me. I haven't written up a true Kedro dataset as I saw a lot of extra code which would not actually help illustrate the functionality. The first block of code is the dataset definition, and the second block simulates how it would run in a kedro workflow (without actually implementing any nodes or pipelines, just for illustration).lazy_load_sql_query_dataset.py
test_dataset.py
Possible Alternatives
Messing with the context and other ways of modifying kedro behavior through templatedconfigloader and hooks, but it's just hacky and not scalable.