[Feature Request] ✨SQLFluff plugin for linting SQL in data catalog.

galenseilis commented 4 days ago

Description

It is nice to be able to use the OmegaConfigLoader (based on OmegaConf) to load templated SQL queries from the catalog. This allows for "dynamic SQL" instructions when parts of the query have parameters that are provided when the catalog is loaded.

However, this sort of templated SQL is not necessarily valid SQL. That potentially invalidates assumptions of tools that can parse SQL.

One of the use cases I have for parsing SQL automatically is for linting. The only package I am currently aware of is SQLFluff.

Context

It would make it easier for me to have automated quality checks on the input SQL instructions while also having being able to take advantage of OmegaConf. While it isn't always the highest priority, I think many projects can benefit from linting their SQL. It helps standardize the formatting and can sometimes reveal other code quality issues.

Possible Implementation

I have not thought about the implementation in detail, but I suspect this would involve using the OmegaConfigLoader, looping through its entries to see which ones have SQL code, presenting that SQL code to SQLFluff, capturing the output from the SQLFluff in std out.

Possible Alternatives

An alternative is to have a plugin that exports valid SQL files from the catalog. If a catalog entry has a SQL string, then write that string to a file in some standardized choice of path. Then anyone who wants to do further analysis on valid SQL code has a snapshot of the current state of the SQL code. That could be helpful for sharing SQL code from Kedro projects anyway.

galenseilis commented 4 days ago

Mentioned here: https://kedro-org.slack.com/archives/C03RKP2LW64/p1720196860544229

deepyaman commented 4 days ago

In short, I think you can definitely build such a third-party plugin (out of personal interest/need), but I don't think it's likely to be something the team would want to maintain as part of the kedro-plugins repo. For one, the Kedro team is quite selective about the set of plugins they support, so that they can ensure good quality and maintenance across a focused set of projects. As a Python-focused framework, maintaining a SQL-specific plugin (that will not be trivial, because it needs to understand the interpolation) seems like high effort to solve a less-frequent use case.

On a separate note, I'm not sure how good it is to implement something that further encourages loading data via SQL into the Python pipeline, instead of doing the processing on the database. A lot of use of *SQL*Datasets can probably be replaced with ibis.TableDataset, providing a more performant and Pythonic experience. This also helps move logic back to the nodes; I would argue having SQL logic in the catalog is an anti-pattern, because the data processing logic should live in the nodes.

Disclaimer: I helped create the current ibis.TableDataset, so account for my bias.

galenseilis commented 4 days ago

In short, I think you can definitely build such a third-party plugin (out of personal interest/need), but I don't think it's likely to be something the team would want to maintain as part of the kedro-plugins repo. For one, the Kedro team is quite selective about the set of plugins they support, so that they can ensure good quality and maintenance across a focused set of projects. As a Python-focused framework, maintaining a SQL-specific plugin (that will not be trivial, because it needs to understand the interpolation) seems like high effort to solve a less-frequent use case.

On a separate note, I'm not sure how good it is to implement something that further encourages loading data via SQL into the Python pipeline, instead of doing the processing on the database. A lot of use of *SQL*Datasets can probably be replaced with ibis.TableDataset, providing a more performant and Pythonic experience. This also helps move logic back to the nodes; I would argue having SQL logic in the catalog is an anti-pattern, because the data processing logic should live in the nodes.

Disclaimer: I helped create the current ibis.TableDataset, so account for my bias.

Offhand I don't know how much effort it would be, but your analysis of the choice of plugins to support makes sense on the face of it.

I don't know anything about ibis, but I see your blog post. I'll take a look at it when I have time.

datajoely commented 1 day ago

Yeah @galenseilis I would second @deepyaman's point and really believe Ibis is the right way to build SQL based pipelines in Kedro.

Kedro is very much build around the ergonomics of DataFrames, if you are looking to write and pipeline SQL there are tools that are better built for that.

galenseilis commented 1 day ago

@datajoely @deepyaman I am looking into Ibis. It looks quite promising, and based on the material I've seen I agree it is a better way.

Unfortunately, until the SQL-to-Ibis translator is more mature I will probably have to keep using at least some of the existing SQL via SQLQueryDataSet and SQLQueryTable in order to be productive.

https://github.com/ibis-project/ibis/issues/9529

But I am hopeful that new stuff that I write can be with Ibis. I'll create an minimum working example integrated with Kedro and show it to my team to get their feedback.

Anyway, thanks for considering my request!

kedro-org / kedro-plugins