kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.83k stars 895 forks source link

Documentation for SparkHook needs updating to v19 #4118

Open alexisdrakopoulos opened 2 weeks ago

alexisdrakopoulos commented 2 weeks ago

Description

The stable doc here: https://docs.kedro.org/en/stable/integrations/pyspark_integration.html is out of date I think.

Specifically: parameters = context.config_loader.get("spark*", "spark*/**")

needs to be update to the new method.

I am mentioning this as I tried config_loader["spark"] with:

CONFIG_LOADER_ARGS = {
    "config_patterns": {
        "spark": ["spark*", "spark*/**"],
    },
}

but it couldn't find the conf/base/spark.ymlfor some reason, so I moved it to conf/databricks/spark.yml and now it finds it.

Documentation page (if applicable)

https://docs.kedro.org/en/stable/integrations/pyspark_integration.html

Context

ElenaKhaustova commented 2 weeks ago

Thank you, @alexisdrakopoulos, for reporting an issue!

I tried to reproduce an issue, and I created conf/base/spark.yml and set

CONFIG_LOADER_ARGS = {
      "base_env": "base",
      "default_run_env": "local",
      "config_patterns": {
          "spark": ["spark*/"],
       }
}

and it seems to be working well; at least it can find conf/base/spark.yml Screenshot 2024-08-27 at 12 05 25

So, for me, it looks like this line in the docs might not be relevant: parameters = context.config_loader.get("spark*", "spark*/**")

We will double-check and come back.