kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.95k stars 903 forks source link

Investigate best way to fix and fix MyPy failure on `pipeline_registry.py` in default template #2526

Open astrojuanlu opened 1 year ago

astrojuanlu commented 1 year ago

Description

The default template has this code in pipeline_registry.py:

https://github.com/kedro-org/kedro/blob/7f447330fc4e933ae4f5f364bb38032600b78d85/kedro/templates/project/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/pipeline_registry.py#L14-L15

However, latest MyPy version (1.2.0 at the time of writing) doesn't have a high opinion on that snippet:

src/openrepair/pipeline_registry.py:13: error: Incompatible types in assignment (expression has type "Union[Pipeline, Literal[0]]", target has type "Pipeline")  [assignment]
Found 1 error in 1 file (checked 8 source files)

I tested this on Python 3.10 with the default MyPy settings. I think it's reasonable because sum has a default start=0, therefore if the pipelines list is empty, the result has a different type:

In [5]: pipelines = [Pipeline([]), Pipeline([])]

In [6]: sum(pipelines)
Out[6]: Pipeline([])

In [7]: sum([])
Out[7]: 0

Context

The reason why sum(pipelines) works at all if because of this:

https://github.com/kedro-org/kedro/blob/7f447330fc4e933ae4f5f364bb38032600b78d85/kedro/pipeline/pipeline.py#L184-L186

which naturally allows this nice syntactic sugar by making this work:

In [9]: 0 + Pipeline([])
Out[9]: Pipeline([])

Steps to Reproduce

  1. Create a new Kedro project with the default template
  2. Run mypy on it
  3. Observe error

Expected Result

The project templates should produce statically correct code, and give no MyPy errors.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

astrojuanlu commented 1 year ago

This line makes MyPy happy but it's not as nice:

pipelines["__default__"] = sum(pipelines.values(), start=Pipeline([]))
antonymilne commented 1 year ago

Hmm, this is an annoying one. I'm sure other people must have come across this also since it's pretty common to do sum like this on custom classes without specifying start.

Does it help if we add a type annotation to __radd__?

def __radd__(self, other: Union[Pipeline, Literal[0]])

Definitely I'd prefer to not explicitly specify start=Pipeline([])) - we could just put a mypy ignore in there if required instead?

astrojuanlu commented 1 year ago

I think the problem is that find_pipelines always returns {"__default__": Pipeline([])} even if it finds nothing, but this dynamic behavior is difficult to communicate to MyPy. Even with the __radd__ annotation, that wouldn't help with the mismatch on the register_pipelines annotation.

I think there only ways are either ignoring the error (always feels like surrender 🏳️) or try to assure MyPy that somehow pipelines.values() will never be empty, but the latter necessarily requires code changes. An alternative to start= could be

    pipelines["__default__"] = sum(pipelines.values()) or Pipeline([])

still looks meh but at least one does not need to know about the obscure sum behavior.

deepyaman commented 1 year ago

I feel like overriding the sum return type should work, like so:

from typing import Iterable, overload

 @overload
 def sum(__iterable: Iterable[Pipeline]) -> Iterable[Pipeline]: ...

The idea is to override https://github.com/python/typeshed/blob/dd2818a41db6cd31e4680abf5e1362a7e5bfb5a6/stdlib/builtins.pyi#L1731-L1748.

I need to test this properly, though; whatever I'm changing seems to be magically make mypy --no-incremental . on a new Kedro project pass, even if I undo changes, so that can't be right... (I am able to get the error from this issue when just installing from PyPI)

astrojuanlu commented 11 months ago

This is a mostly harmless code change and would allow us to introduce MyPy as a linter. Is anybody strongly opposed to https://github.com/kedro-org/kedro/issues/2526#issuecomment-1513481730 or https://github.com/kedro-org/kedro/issues/2526#issuecomment-1512738943 ?

antonymilne commented 11 months ago

I am not strongly opposed but I am opposed enough to leave a comment 😀

I think that sum(pipelines.values()) is significantly nicer than the suggestions you linked to. IMO the correct solution here would be to some fix that allows that to pass mypy, e.g. @deepyaman's suggestion https://github.com/kedro-org/kedro/issues/2526#issuecomment-1513632379.

If it's too difficult to get something like that working then changing sum(pipelines.values()) is ok by me, but I do think it's worth investing a bit more time investigating alternatives because your proposed solutions make all users' pipeline_registry.py (which is arguably already too complicated) even harder to understand.

astrojuanlu commented 9 months ago

I'm leaving this to whoever takes the ticket, since all solutions are quite small and doesn't seem to be consensus about that, but it's also quite minor