kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.48k stars 874 forks source link

UnboundLocalError: cannot access local variable 'pipelines_package' where it is not associated with a value #3847

Open JenspederM opened 2 months ago

JenspederM commented 2 months ago

Description

Error is thrown when trying to print find_pipelines from the kedro.framework.project module.

Context

Unable to use find_pipelines

Steps to Reproduce

  1. Add print(find_pipelines()) to the bottom of the pipeline_regitry.py file
  2. Run the file python ./src/<project>/pipeline_regitry.py

Expected Result

A dict of pipelines.

Actual Result

I get the following error:

[05/02/24 18:05:49] WARNING  /Users/.../.venv/lib/python3.12/site-pac warnings.py:110
                             kages/kedro/framework/project/__init__.py:350: UserWarning: An error                      
                             occurred while importing the 'None.pipeline' module. Nothing defined                      
                             therein will be returned by 'find_pipelines'.                                             

                             Traceback (most recent call last):                                                        
                               File                                                                                    
                             "/Users/.../.venv/lib/python3.12/site-pa                
                             ckages/kedro/framework/project/__init__.py", line 347, in find_pipelines                  
                                 pipeline_module = importlib.import_module(pipeline_module_name)                       
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                       
                               File                                                                                    
                             "/Users/.../.rye/py/cpython@3.12.2/install/lib/python3.12/i                
                             mportlib/__init__.py", line 90, in import_module                                          
                                 return _bootstrap._gcd_import(name[level:], package, level)                           
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                           
                               File "<frozen importlib._bootstrap>", line 1387, in _gcd_import                         
                               File "<frozen importlib._bootstrap>", line 1360, in _find_and_load                      
                               File "<frozen importlib._bootstrap>", line 1310, in                                     
                             _find_and_load_unlocked                                                                   
                               File "<frozen importlib._bootstrap>", line 488, in                                      
                             _call_with_frames_removed                                                                 
                               File "<frozen importlib._bootstrap>", line 1387, in _gcd_import                         
                               File "<frozen importlib._bootstrap>", line 1360, in _find_and_load                      
                               File "<frozen importlib._bootstrap>", line 1324, in                                     
                             _find_and_load_unlocked                                                                   
                             ModuleNotFoundError: No module named 'None'                                               

                               warnings.warn(                                                                          

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/.../project/src/project/pipeline_registy.py:21 in <module>                                                                             │
│                                                                                                  │
│   18                                                                                             │
│   19                                                                                             │
│   20 if __name__ == "__main__":                                                                  │
│ ❱ 21 │   print(register_pipelines())                                                             │
│   22                                                                                             │
│                                                                                                  │
│ /Users/.../project/src/project/pipeline_registry.py:15 in register_pipelines                                                                   │
│                                                                                                  │
│   12 │   Returns:                                                                                │
│   13 │   │   A mapping from pipeline names to ``Pipeline`` objects.                              │
│   14 │   """                                                                                     │
│ ❱ 15 │   pipelines = find_pipelines()                                                            │
│   16 │   pipelines["__default__"] = sum(pipelines.values())                                      │
│   17 │   return pipelines                                                                        │
│   18                                                                                             │
│                                                                                                  │
│ /Users/.../.venv/lib/python3.12/site-packages/kedro/framework/project/__init__.py:367 in find_pipelines                                                        │
│                                                                                                  │
│   364 │   │   if str(exc) == f"No module named '{PACKAGE_NAME}.pipelines'":                      │
│   365 │   │   │   return pipelines_dict                                                          │
│   366 │                                                                                          │
│ ❱ 367 │   for pipeline_dir in pipelines_package.iterdir():                                       │
│   368 │   │   if not pipeline_dir.is_dir():                                                      │
│   369 │   │   │   continue                                                                       │
│   370                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UnboundLocalError: cannot access local variable 'pipelines_package' where it is not associated with a value

Your Environment

merelcht commented 1 month ago

Hi @JenspederM, thanks for flagging this issue. Can I ask what your use case is for printing the result of find_pipelines()?

This method has been added to enable auto discovery of pipelines and does some stuff in the back to make sure your project and its modules are discoverable (https://docs.kedro.org/en/stable/nodes_and_pipelines/pipeline_registry.html). It's meant to run as part of a "regular" Kedro flow where it's preceded by certain project setup methods. You can fix your script by calling bootstrap_project() before find_pipelines() (https://docs.kedro.org/en/stable/kedro_project_setup/session.html#bootstrap-project-and-configure-project). However, I would only recommend doing that for exploration and not if you're planning to run that code in production.

Let me know if this makes sense!

JenspederM commented 1 month ago

Hi @merelcht,

Thank you for your reply.

I am using find_pipelines() to generate databricks assets bundle resources. I am working on a template for asset bundles that uses Kedro for defining pipelines and dependencies and databricks workflows for scheduling. You can find the project here

Thanks for the suggesting bootstrap_project(). For now, I have been using configure_project(<package-name>) as used in databricks_run.py in the databricks-iris starter.

You can see my exact usage right here

JenspederM commented 1 month ago

@merelcht

I have been thinking of making a cookiecutter for Kedro as well. Do you think there would be any interest in this?

I made the template based on my own experience of running large scale Databricks projects in production with many contributors of varying levels of experience.

astrojuanlu commented 1 month ago

I'd say, regardless of use case, raising an UnboundLocalError from internal code should not happen, but a more informative error instead.

I have been thinking of making a cookiecutter for Kedro as well. Do you think there would be any interest in this?

Of course! When you get to do it, we can promote it on https://github.com/kedro-org/awesome-kedro

Also consider exploring https://github.com/copier-org/copier/, a modern alternative to cookiecutter

JenspederM commented 1 month ago

The only problem that I haven't really found a solution for is how I would get the workspace host from the users' Databricks config without using the Databricks CLI.

JenspederM commented 1 month ago

I'd say, regardless of use case, raising an UnboundLocalError from internal code should not happen, but a more informative error instead.

@astrojuanlu I also looked into the UnboundLocalError, and I see that it could be resolved by adding asserts or running validate_settings() in find_pipelines() and ParallelRunner._run().

Or does it deserve a greater redesign?

IMO global variables can be quite dangerous when used like this, so I would probably advice for redesigning this logic to remove the use of globals.

astrojuanlu commented 2 days ago

Moving this to our Inbox so that we can look at it and it doesn't get lost.

astrojuanlu commented 2 days ago

IMO global variables can be quite dangerous when used like this, so I would probably advice for redesigning this logic to remove the use of globals.

For the record, I agree