kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.82k stars 895 forks source link

Git fatal error when using a Kedro project inside directory without an initialised repo #1401

Closed oj-m closed 2 years ago

oj-m commented 2 years ago

Description

Kedro Docs contains a pandas Iris example project which has Python 3.6 in the requirements file, which does not execute on newer Apple M1 chipsets. Attempting to execute it on an M1-compatible version of Python 3.8.13 via kedro run results in:

kedro.framework.session.store - INFO - `read()` not implemented for `BaseSessionStore`. Assuming empty store.
fatal: Needed a single revision

Context

Attempt the docs instructions on https://kedro.readthedocs.io/en/stable/get_started/example_project.html on an Apple M1.

Steps to Reproduce

Execute the following on an Apple M1:

kedro new --starter=pandas-iris
cd pandas-iris
git init
pip install -r src/requirements.txt
kedro run

Expected Result

Run without fatal errors.

Actual Result

kedro.framework.session.store - INFO - `read()` not implemented for `BaseSessionStore`. Assuming empty store.
fatal: Needed a single revision

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

antonymilne commented 2 years ago

Hi @oj-m and thanks for the bug report. Is there anything else in the error message at all? It's not a very helpful message unfortunately but I suspect this is something to do with git rather than kedro itself. Is the directory you're running kedro in version controlled?

Kedro Docs contains a pandas Iris example project which has Python 3.6 in the requirements file

Please could you point out where the Python 3.6 requirement is? I'm a bit surprised to hear this. Kedro 0.17.7 should indeed work with 3.6 , 3.7 and 3.8.

datajoely commented 2 years ago

I would also add - does this work if you do not init the repository? That error is from git not kedro? This stackoverflow post may help as Homebrew users are reporting similar issues.

avan-sh commented 2 years ago

Hi @oj-m, I tried to replicate the error on my M1 machine but couldn't get the error. This error might be related to some other package instead

oj-m commented 2 years ago

Ok, cleaned everything up and skipped git init, and received missing jupyter_client dependency:

$ kedro new --starter=pandas-iris
2022-04-02 16:05:04,377 - kedro.framework.cli.hooks.manager - INFO - Registered CLI hooks from 1 installed plugin(s): kedro-telemetry-0.1.4
Kedro-Telemetry is installed, but you have opted out of sharing usage analytics so none will be collected.

Project Name:
=============
Please enter a human readable name for your new project.
Spaces and punctuation are allowed.
 [New Kedro Project]:

Repository Name:
================
Please enter a directory name for your new project repository.
Alphanumeric characters, hyphens and underscores are allowed.
Lowercase is recommended.
 [new-kedro-project]:

Python Package Name:
====================
Please enter a valid Python package name for your project package.
Alphanumeric characters and underscores are allowed.
Lowercase is recommended. Package name must start with a letter
or underscore.
 [new_kedro_project]:

Change directory to the project generated in /Users/oj-m/Documents/new-kedro-project

A best-practice setup includes initialising git and creating a virtual environment before running ``kedro install`` to install project-specific dependencies. Refer to the Kedro documentation: https://kedro.readthedocs.io/
$ cd new-kedro-project
$ kedro install
2022-04-02 16:06:39,161 - kedro.framework.cli.hooks.manager - INFO - Registered CLI hooks from 1 installed plugin(s): kedro-telemetry-0.1.4
As an open-source project, we collect usage analytics.
We cannot see nor store information contained in a Kedro project.
You can find out more by reading our privacy notice:
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice
Do you opt into usage analytics?  [y/N]:
Kedro-Telemetry is installed, but you have opted out of sharing usage analytics so none will be collected.
DeprecationWarning: Command `kedro install` will be deprecated in Kedro 0.18.0. In the future use `pip install -r src/requirements.txt` instead. If you were running `kedro install` with the `--build-reqs` flag, we recommend running `kedro build-reqs` followed by `pip install -r src/requirements.txt`
No requirements.in found. Copying contents from requirements.txt...
/Users/oj-m/.pyenv/versions/3.8.13/envs/kedro/bin/python3.8 -m piptools compile -q /Users/oj-m/Documents/new-kedro-project/src/requirements.in
Could not find a version that matches jupyter_client<7.0,>=4.1,>=5.1,>=5.3.4,>=6.1.12,>=7.0.0 (from -r /Users/oj-m/Documents/new-kedro-project/src/requirements.in (line 7))
Tried: 4.0.0, 4.0.0, 4.0.0, 4.1.0, 4.1.0, 4.1.1, 4.1.1, 4.1.1, 4.2.0, 4.2.0, 4.2.0, 4.2.1, 4.2.1, 4.2.1, 4.2.2, 4.2.2, 4.2.2, 4.3.0, 4.3.0, 4.3.0, 4.4.0, 4.4.0, 5.0.0, 5.0.0, 5.0.1, 5.0.1, 5.1.0, 5.1.0, 5.2.0, 5.2.0, 5.2.1, 5.2.1, 5.2.2, 5.2.2, 5.2.3, 5.2.3, 5.2.4, 5.2.4, 5.3.0, 5.3.0, 5.3.1, 5.3.1, 5.3.2, 5.3.2, 5.3.3, 5.3.3, 5.3.4, 5.3.4, 5.3.5, 5.3.5, 6.0.0, 6.0.0, 6.1.0, 6.1.0, 6.1.1, 6.1.1, 6.1.2, 6.1.2, 6.1.3, 6.1.3, 6.1.5, 6.1.5, 6.1.6, 6.1.6, 6.1.7, 6.1.7, 6.1.8, 6.1.8, 6.1.9, 6.1.9, 6.1.10, 6.1.10, 6.1.11, 6.1.11, 6.1.12, 6.1.12, 6.1.13, 6.1.13, 6.2.0, 6.2.0, 7.0.0, 7.0.0, 7.0.1, 7.0.1, 7.0.2, 7.0.2, 7.0.3, 7.0.3, 7.0.4, 7.0.4, 7.0.5, 7.0.5, 7.0.6, 7.0.6, 7.1.0, 7.1.0, 7.1.1, 7.1.1, 7.1.2, 7.1.2, 7.2.0, 7.2.0, 7.2.1, 7.2.1
Skipped pre-versions: 7.0.0a0, 7.0.0a0, 7.0.0a1, 7.0.0a1, 7.0.0rc0, 7.0.0rc0, 7.0.0rc1, 7.0.0rc1
There are incompatible versions in the resolved dependencies:
  jupyter_client<7.0,>=5.1 (from -r /Users/oj-m/Documents/new-kedro-project/src/requirements.in (line 7))
  jupyter-client>=5.3.4 (from notebook==6.4.10->jupyter==1.0.0->-r /Users/oj-m/Documents/new-kedro-project/src/requirements.in (line 6))
  jupyter-client>=6.1.12 (from ipykernel==6.11.0->jupyter==1.0.0->-r /Users/oj-m/Documents/new-kedro-project/src/requirements.in (line 6))
  jupyter-client>=7.0.0 (from jupyter-console==6.4.3->jupyter==1.0.0->-r /Users/oj-m/Documents/new-kedro-project/src/requirements.in (line 6))
  jupyter-client<7.0,>=5.1 (from kedro[pandas.csvdataset]==0.17.7->-r /Users/oj-m/Documents/new-kedro-project/src/requirements.in (line 9))
  jupyter-client>=6.1.12 (from jupyter-server==1.16.0->jupyterlab==3.3.2->-r /Users/oj-m/Documents/new-kedro-project/src/requirements.in (line 8))
  jupyter-client>=4.1 (from qtconsole==5.3.0->jupyter==1.0.0->-r /Users/oj-m/Documents/new-kedro-project/src/requirements.in (line 6))
avan-sh commented 2 years ago

That is a fair error that got fixed in 0.18. As a workaround, add jupyter-console<6.4.3 # 6.4.3 requires jupyter_client>=7.0 as mentioned in 1356.

And on your earlier error, would be great to hear if you were able to find a fix. If not which version of git are you using?

oj-m commented 2 years ago

Updating the requirements.txt file with the versions as described above resolved the kedro install step, thanks.

However, the Git issues persist. My installed versions:

Executing kedro run prior to running git init results in:

2022-04-02 17:14:41,321 - kedro.framework.cli.hooks.manager - INFO - Registered CLI hooks from 1 installed plugin(s): kedro-telemetry-0.1.4
Kedro-Telemetry is installed, but you have opted out of sharing usage analytics so none will be collected.
2022-04-02 17:14:41,361 - kedro.framework.session.store - INFO - `read()` not implemented for `BaseSessionStore`. Assuming empty store.
fatal: not a git repository (or any of the parent directories): .git
2022-04-02 17:14:41,371 - kedro.framework.session.session - WARNING - Unable to git describe /Users/oj-m/Documents/pandas-iris
...

Executing kedro run after running git init results in:

2022-04-02 17:14:57,575 - kedro.framework.cli.hooks.manager - INFO - Registered CLI hooks from 1 installed plugin(s): kedro-telemetry-0.1.4
Kedro-Telemetry is installed, but you have opted out of sharing usage analytics so none will be collected.
2022-04-02 17:14:57,614 - kedro.framework.session.store - INFO - `read()` not implemented for `BaseSessionStore`. Assuming empty store.
fatal: Needed a single revision
2022-04-02 17:14:57,627 - kedro.framework.session.session - WARNING - Unable to git describe /Users/oj-m/Documents/pandas-iris
...
oj-m commented 2 years ago

Well, looks like it requires a first commit, not just an init...

git add * && git commit -m "First"
kedro run

The rest seems to be working. Thanks.

oj-m commented 2 years ago

There is one note at the end, however, that doesn't match the expectations from the docs:

...
2022-04-02 17:34:32,058 - kedro.runner.sequential_runner - INFO - Completed 4 out of 4 tasks
2022-04-02 17:34:32,058 - kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully.
2022-04-02 17:34:32,058 - kedro.framework.session.store - INFO - `save()` not implemented for `BaseSessionStore`. Skipping the step.

Assuming its benign?

avan-sh commented 2 years ago

Yes, that is only an info log. So no trouble with that.

Good that you have a workaround for the git error. I'm able to run the pipeline without any error, so I can't replicate the error to help :(

datajoely commented 2 years ago

Some users still experiencing issues - still investigating

avan-sh commented 2 years ago

Adding more information

  1. This is not a blocking error and the fatal error doesn't stop pipeline runs.
  2. This persists on non-M1 machines as well. I tested the following script which resulted in the same error on both machines
    
    import subprocess
    import logging
    from typing import Any, Dict, Iterable, Union
    from pathlib import Path

def _describe_git(project_path: Path) -> Dict[str, Dict[str, Any]]: project_path = str(project_path) try: res = subprocess.check_output( ["git", "rev-parse", "--short", "HEAD"], cwd=project_path )

subprocess.check_output() raises NotADirectoryError on Windows

except (subprocess.CalledProcessError, FileNotFoundError, NotADirectoryError):
    logging.getLogger(__name__).warning("Unable to git describe %s", project_path)
    return {}
git_data = {"commit_sha": res.decode().strip()}  # type: Dict[str, Any]
res = subprocess.check_output(["git", "status", "--short"], cwd=project_path)
git_data["dirty"] = bool(res.decode().strip())
return {"git": git_data}

project_path = Path.cwd()

_describe_git(project_path)

datajoely commented 2 years ago

Okay so I understand this better now - the exception handler does successfully catch the error on the python side, but the subprocess will still cause the quite scary error message to be presented:

image

We could pre-check this by doing a couple of things:

oj-m commented 2 years ago

@datajoely That's what initially gave me pause, as I didn't catch that the fatal error was actually a benign INFO log. Without diving into the codebase, it just wasn't immediately clear the error was bubbling messaging up from an expected git state.

antonymilne commented 2 years ago

Just to understand where we stand on this... This isn't actually related to Apple M1 chips at all, right? It's just what happens if you do kedro run in a directory which hasn't had git commit yet?

datajoely commented 2 years ago

It's not - I've changed the description

datajoely commented 2 years ago

Fix in review https://github.com/kedro-org/kedro/pull/1422/files