kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.48k stars 874 forks source link

PySpark is not being included in requirements.txt file in a new kedro project #3848

Closed thorugo-code closed 3 weeks ago

thorugo-code commented 2 months ago

Description

After starting a new kedro project with all the packages selected, I went into my project folder to install the requirements and PySpark isn't being installed because it's not included in the list of packages.

Context

The lack of PySpark is preventing the application from running.

Steps to Reproduce

  1. python -m venv venv
  2. ./venv/Scripts/activate.ps1
  3. pip install kedro
  4. kedro new
  5. select all packages and answer yes to pipeline example
  6. cd app
  7. pip install -r requirements.txt
  8. kedro run

Expected Result

Open application

Actual Result


╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in _run_module_as_main:198                                                                       │
│ in _run_code:88                                                                                  │
│                                                                                                  │
│ in <module>:7                                                                                    │
│                                                                                                  │
│ F:\Testes\kedro-test\venv\Lib\site-packages\kedro\framework\cli\cli.py:233 in main               │
│                                                                                                  │
│   230 │   cli_collection = KedroCLI(                                                             │
│   231 │   │   project_path=_find_kedro_project(Path.cwd()) or Path.cwd()                         │
│   232 │   )                                                                                      │
│ ❱ 233 │   cli_collection()                                                                       │
│   234                                                                                            │
│                                                                                                  │
│ F:\Testes\kedro-test\venv\Lib\site-packages\click\core.py:1157 in __call__                       │
│                                                                                                  │
│ F:\Testes\kedro-test\venv\Lib\site-packages\kedro\framework\cli\cli.py:130 in main               │
│                                                                                                  │
│   127 │   │   )                                                                                  │
│   128 │   │                                                                                      │
│   129 │   │   try:                                                                               │
│ ❱ 130 │   │   │   super().main(                                                                  │
│   131 │   │   │   │   args=args,                                                                 │
│   132 │   │   │   │   prog_name=prog_name,                                                       │
│   133 │   │   │   │   complete_var=complete_var,                                                 │
│                                                                                                  │
│ F:\Testes\kedro-test\venv\Lib\site-packages\click\core.py:1078 in main                           │
│                                                                                                  │
│ F:\Testes\kedro-test\venv\Lib\site-packages\click\core.py:1688 in invoke                         │
│                                                                                                  │
│ F:\Testes\kedro-test\venv\Lib\site-packages\click\core.py:1434 in invoke                         │
│                                                                                                  │
│ F:\Testes\kedro-test\venv\Lib\site-packages\click\core.py:783 in invoke                          │
│                                                                                                  │
│ F:\Testes\kedro-test\venv\Lib\site-packages\kedro\framework\cli\project.py:222 in run            │
│                                                                                                  │
│   219 │   tuple_tags = tuple(tags)                                                               │
│   220 │   tuple_node_names = tuple(node_names)                                                   │
│   221 │                                                                                          │
│ ❱ 222 │   with KedroSession.create(                                                              │
│   223 │   │   env=env, conf_source=conf_source, extra_params=params                              │
│   224 │   ) as session:                                                                          │
│   225 │   │   session.run(                                                                       │
│                                                                                                  │
│ F:\Testes\kedro-test\venv\Lib\site-packages\kedro\framework\session\session.py:151 in create     │
│                                                                                                  │
│   148 │   │   Returns:                                                                           │
│   149 │   │   │   A new ``KedroSession`` instance.                                               │
│   150 │   │   """                                                                                │
│ ❱ 151 │   │   validate_settings()                                                                │
│   152 │   │                                                                                      │
│   153 │   │   session = cls(                                                                     │
│   154 │   │   │   project_path=project_path,                                                     │
│                                                                                                  │
│ F:\Testes\kedro-test\venv\Lib\site-packages\kedro\framework\project\__init__.py:293 in           │
│ validate_settings                                                                                │
│                                                                                                  │
│   290 │   │   )                                                                                  │
│   291 │   # Check if file exists, if it does, validate it.                                       │
│   292 │   if importlib.util.find_spec(f"{PACKAGE_NAME}.settings") is not None:                   │
│ ❱ 293 │   │   importlib.import_module(f"{PACKAGE_NAME}.settings")                                │
│   294 │   else:                                                                                  │
│   295 │   │   logger = logging.getLogger(__name__)                                               │
│   296 │   │   logger.warning("No 'settings.py' found, defaults will be used.")                   │
│                                                                                                  │
│ C:\Users\vitor\AppData\Local\Programs\Python\Python311\Lib\importlib\__init__.py:126 in          │
│ import_module                                                                                    │
│                                                                                                  │
│   123 │   │   │   if character != '.':                                                           │
│   124 │   │   │   │   break                                                                      │
│   125 │   │   │   level += 1                                                                     │
│ ❱ 126 │   return _bootstrap._gcd_import(name[level:], package, level)                            │
│   127                                                                                            │
│   128                                                                                            │
│   129 _RELOADING = {}                                                                            │
│ in _gcd_import:1204                                                                              │
│ in _find_and_load:1176                                                                           │
│ in _find_and_load_unlocked:1147                                                                  │
│ in _load_unlocked:690                                                                            │
│ in exec_module:940                                                                               │
│ in _call_with_frames_removed:241                                                                 │
│                                                                                                  │
│ F:\Testes\kedro-test\api\src\api\settings.py:6 in <module>                                       │
│                                                                                                  │
│    3 https://docs.kedro.org/en/stable/kedro_project_setup/settings.html."""                      │
│    4                                                                                             │
│    5 # Instantiated project hooks.                                                               │
│ ❱  6 from api.hooks import SparkHooks  # noqa: E402                                              │
│    7                                                                                             │
│    8 # Hooks are executed in a Last-In-First-Out (LIFO) order.                                   │
│    9 HOOKS = (SparkHooks(),)                                                                     │
│                                                                                                  │
│ F:\Testes\kedro-test\api\src\api\hooks.py:2 in <module>                                          │
│                                                                                                  │
│    1 from kedro.framework.hooks import hook_impl                                                 │
│ ❱  2 from pyspark import SparkConf                                                               │
│    3 from pyspark.sql import SparkSession                                                        │
│    4                                                                                             │
│    5                                                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'pyspark'

Your Environment

merelcht commented 1 month ago

Hi @thorugo-code, thanks for opening this issue. I'm sorry you're facing problems getting started with Kedro. It is actually not expected that pyspark is added to the requirements.txt. Instead, we'd expect:

kedro-datasets[spark-sparkdataset]>=3.0; python_version >= "3.9"
kedro-datasets[spark.SparkDataset]>=1.0; python_version < "3.9"

to be added. Our SparkDataset had a dependency on pyspark, so this becomes a dependency in that way. I've replicated the steps and on my side pyspark is successfully installed without needing to make any alterations. Could you share your resulting requirements.txt file?

merelcht commented 3 weeks ago

Closing this due to inactivity. Feel free to re-open this issue if you're facing the same problem.