Open DavidRetana-TomTom opened 3 months ago
@DavidRetana-TomTom Did you mean that you expect there is a query
argument? I am not sure what's the feature request here.
@DavidRetana-TomTom Did you mean that you expect there is a
query
argument? I am not sure what's the feature request here.
Yes exactly
@DavidRetana-TomTom this is a great push - this dataset is quite old so this may be newer functionality. I think it's a good idea to add this to our implementation.
There are two steps at this point:
query
param like you need.kedro-datasets
would you be interested in doing this? We'd be here to coach you through the process.ow do you feel about raising a PR to make this work? We can coach you through the process.
I take what is described here and hopefully this can be a starting point or workaround, I only implemented the load
method. The change is in noklam/sparkjdbcdataset-not-working-639
The diff: https://github.com/kedro-org/kedro-plugins/compare/noklam/sparkjdbcdataset-not-working-639?expand=1
"""SparkJDBCDataset to load and save a PySpark DataFrame via JDBC."""
from copy import deepcopy
from typing import Any
from kedro.io.core import AbstractDataset, DatasetError
from pyspark.sql import DataFrame
from kedro_datasets.spark.spark_dataset import _get_spark
class SparkJDBCDataset(AbstractDataset[DataFrame, DataFrame]):
"""``SparkJDBCDataset`` loads data from a database table accessible
via JDBC URL url and connection properties and saves the content of
a PySpark DataFrame to an external database table via JDBC. It uses
``pyspark.sql.DataFrameReader`` and ``pyspark.sql.DataFrameWriter``
internally, so it supports all allowed PySpark options on ``jdbc``.
Example usage for the
`YAML API
I take what is described here and hopefully this can be a starting point or workaround, I only implemented the
load
method. The change is in noklam/sparkjdbcdataset-not-working-639The diff: https://github.com/kedro-org/kedro-plugins/compare/noklam/sparkjdbcdataset-not-working-639?expand=1
Details
That should be enough for my use case. I can't open a pull request because I am not a collaborator of this project.
@DavidRetana-TomTom you can open one via the Forking workflow! We'd really appreciate it if you have a chance
Description
When using SparkJDBCDataset you need to specify table name as a mandatory parameter. However, using the spark JDBC connector directly, you can specify a query to retrieve data from the database instead of hardcoding a single table. Check out this link. According to the official Spark documentation:
The specified query will be parenthesized and used as a subquery in the FROM clause. Below are a couple of restrictions while using this option.
Context
This is specially important if you want to read data from multiple tables in the database or if you want to run complex or spatial queries in the database instead of retrieving all the data and perform the computations in the cluster.
Steps to Reproduce
Source code right now (https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/kedro_datasets/spark/spark_jdbc_dataset.py):
Expected Result
I would like to have something like the following:
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro
orkedro -V
): 0.19.3pip show kedro-airflow
):python -V
): 3.10