kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
93 stars 89 forks source link

Kedro-Airflow not working with Astrocloud #13

Closed yetudada closed 5 months ago

yetudada commented 2 years ago

Raised by @jweiss-ocurate:

Description

I am trying to run a simple spaceflights example with Astrocloud. I wasn't sure if anyone has been able to get it to work.

Here is the DockerFile: FROM quay.io/astronomer/astro-runtime:4.1.0

RUN pip install --user new_kedro_project-0.1-py3-none-any.whl --ignore-requires-python

Context

I am trying to use kedro-airflow with astrocloud.

Steps to Reproduce

  1. Follow directions here https://kedro.readthedocs.io/en/latest/10_deployment/11_airflow_astronomer.html
  2. Replace the DockerFile with the above mentioned image.

Expected Result

Complete Kedro Run on local Airflow image.

Actual Result

Failure in local Airflow image. [2022-02-26, 16:43:26 UTC] {store.py:32} INFO - read() not implemented for BaseSessionStore. Assuming empty store. [2022-02-26, 16:43:26 UTC] {session.py:78} WARNING - Unable to git describe /usr/local/airflow [2022-02-26, 16:43:29 UTC] {local_task_job.py:154} INFO - Task exited with return code Negsignal.SIGKILL

Your Environment

Include as many relevant details about the environment you experienced the bug in:

yetudada commented 2 years ago

From @jacobweiss2305:

Kedro-Airflow plugin version used (get it by running pip show kedro-airflow): 0.4.1 Airflow version (airflow --version): > 2.0.0 Kedro version used (pip show kedro or kedro -V): 0.17.7 Python version used (python -V): >3.9 Operating system and version: Ubuntu Linux 20.04

yetudada commented 2 years ago

From @limdauto:

Hi @jacobweiss2305 please try python 3.8. Support for 3.9 hasn't been out yet.

yetudada commented 2 years ago

From @jacobweiss2305:

Hi @limdauto

Support for Kedro and Python 3.9 is available using pip install kedro --ignore-requires-python (https://github.com/kedro-org/kedro/issues/710)

yetudada commented 2 years ago

From @jweiss-ocurate:

Hi @limdauto

Here are the exact steps I am taking:

Kedro + Airflow + AstronomerCloud

Environment

  1. Python 3.9.6
  2. Ubuntu 20.04
  3. Kedro == 0.17.7
  4. Kedro-Airflow == 0.4.1

Steps

  1. mkdir astro_cloud_kedro
  2. cd astro_cloud_kedro
  3. astrocloud dev init
  4. python -m venv venv && source venv/bin/activate
  5. pip install kedro --ignore-requires-python
  6. pip install kedro-airflow --ignore-requires-python
  7. kedro new --starter=spaceflights
  8. cp -r new-kedro-project/* . && rm -rf new-kedro-project
  9. pip install -r src/requirements.txt --ignore-requires-python
  10. kedro package
  11. Edit the DockerFile
FROM [quay.io/astronomer/astro-runtime:4.1.0](http://quay.io/astronomer/astro-runtime:4.1.0)

RUN pip install --user src/dist/new_kedro_project-0.1-py3-none-any.whl --ignore-requires-python
  1. kedro airflow create --target-dir=dags/ --env=base
  2. astrocloud dev start

Error

  1. Go to localhost:8080
  2. Activate new-kedro-project dag in Airflow
  3. The first step should fail with the following logs:
*** Failed to verify remote log exists s3:///new-kedro-project/data-processing-preprocess-companies-node/2022-02-28T14:47:01.235178+00:00/1.log.
Please provide a bucket_name instead of "s3:///new-kedro-project/data-processing-preprocess-companies-node/2022-02-28T14:47:01.235178+00:00/1.log"
*** Falling back to local log
*** Reading local file: /usr/local/airflow/logs/new-kedro-project/data-processing-preprocess-companies-node/2022-02-28T14:47:01.235178+00:00/1.log
[2022-02-28, 15:17:11 UTC] {taskinstance.py:1037} INFO - Dependencies all met for <TaskInstance: new-kedro-project.data-processing-preprocess-companies-node scheduled__2022-02-28T14:47:01.235178+00:00 [queued]>
[2022-02-28, 15:17:12 UTC] {taskinstance.py:1037} INFO - Dependencies all met for <TaskInstance: new-kedro-project.data-processing-preprocess-companies-node scheduled__2022-02-28T14:47:01.235178+00:00 [queued]>
[2022-02-28, 15:17:12 UTC] {taskinstance.py:1243} INFO - 
--------------------------------------------------------------------------------
[2022-02-28, 15:17:12 UTC] {taskinstance.py:1244} INFO - Starting attempt 1 of 2
[2022-02-28, 15:17:12 UTC] {taskinstance.py:1245} INFO - 
--------------------------------------------------------------------------------
[2022-02-28, 15:17:12 UTC] {taskinstance.py:1264} INFO - Executing <Task(KedroOperator): data-processing-preprocess-companies-node> on 2022-02-28 14:47:01.235178+00:00
[2022-02-28, 15:17:12 UTC] {standard_task_runner.py:52} INFO - Started process 220 to run task
[2022-02-28, 15:17:12 UTC] {standard_task_runner.py:76} INFO - Running: ['airflow', 'tasks', 'run', 'new-kedro-project', 'data-processing-preprocess-companies-node', 'scheduled__2022-02-28T14:47:01.235178+00:00', '--job-id', '2', '--raw', '--subdir', 'DAGS_FOLDER/new_kedro_project_dag.py', '--cfg-path', '/tmp/tmpmr1pmxmb', '--error-file', '/tmp/tmpqzqs8xs8']
[2022-02-28, 15:17:12 UTC] {standard_task_runner.py:77} INFO - Job 2: Subtask data-processing-preprocess-companies-node
[2022-02-28, 15:17:12 UTC] {logging_mixin.py:109} INFO - Running <TaskInstance: new-kedro-project.data-processing-preprocess-companies-node scheduled__2022-02-28T14:47:01.235178+00:00 [running]> on host 3d8fc15ee46a
[2022-02-28, 15:17:12 UTC] {taskinstance.py:1429} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=new-kedro-project
AIRFLOW_CTX_TASK_ID=data-processing-preprocess-companies-node
AIRFLOW_CTX_EXECUTION_DATE=2022-02-28T14:47:01.235178+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2022-02-28T14:47:01.235178+00:00
[2022-02-28, 15:17:12 UTC] {store.py:32} INFO - `read()` not implemented for `BaseSessionStore`. Assuming empty store.
[2022-02-28, 15:17:12 UTC] {session.py:78} WARNING - Unable to git describe /usr/local/airflow
[2022-02-28, 15:17:12 UTC] {logging_mixin.py:109} WARNING - /home/astro/.local/lib/python3.9/site-packages/kedro/config/config.py:296 UserWarning: Duplicate environment detected! Skipping re-loading from configuration path: /usr/local/airflow/conf/base
[2022-02-28, 15:17:13 UTC] {local_task_job.py:154} INFO - Task exited with return code Negsignal.SIGKILL
[2022-02-28, 15:17:13 UTC] {taskinstance.py:1272} INFO - Marking task as UP_FOR_RETRY. dag_id=new-kedro-project, task_id=data-processing-preprocess-companies-node, execution_date=20220228T144701, start_date=20220228T151711, end_date=20220228T151713
[2022-02-28, 15:17:14 UTC] {local_task_job.py:264} INFO - 0 downstream tasks scheduled from follow-on schedule check
yetudada commented 2 years ago

From @sunkickr:

@jweiss-ocurate this may be a memory issue based on the task logs showing Negsignal.SIGKILL. Could you try increasing the amount of local memory allocated to docker?

yetudada commented 2 years ago

From @idanov:

@jweiss-ocurate I can confirm we could reproduce that. We'll try to debug what's causing it and update you with any findings we have here.

yetudada commented 2 years ago

From @jweiss-ocurate:

Astronomer worked on this with me. The current docker image for Astronomer Cloud requires python 3.9. So I had to install kedro using --ignore-requires-python.

Astronomer was able to add a quick fix by reinstalling python 3.7 in the dockerfile.

yetudada commented 2 years ago

From @noklam:

@jweiss-ocurate Does it works after downgrading the Python version?

yetudada commented 2 years ago

From @jweiss-ocurate:

Yes it does.

noklam commented 2 years ago

I try to get it running with develop but was not success.

  1. astrocloud dev start doesn't really allow volume mounting so I can't install a local copy of kedro
  2. No permission of git and even shipping the entire repo into the docker and installation seems to be blocked. (see error below)

I wonder if there is anything special with astrocloud or we could just test it with a custom Airflow setup to get rid of these restrictions.

I also notice it is using quay.io/astronomer/astro-runtime instead of astronomer/ap-airflow that is used in the documentation.

#13 0.247 + pip install kedro_develop                                                                                                                                                                                                  
#13 0.589 Defaulting to user installation because normal site-packages is not writeable                                                                                                                                                
#13 0.610 Looking in links: https://pip.astronomer.io/simple/astronomer-fab-security-manager/                                                                                                                                          
#13 0.973 ERROR: Could not find a version that satisfies the requirement kedro_develop (from versions: none)                                                                                                                           
#13 0.973 ERROR: No matching distribution found for kedro_develop
#13 1.274 WARNING: You are using pip version 21.3.1; however, version 22.0.4 is available.
#13 1.274 You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
noklam commented 2 years ago

@jweiss-ocurate Could you share the latest Dockerfile that runs successfully?

noklam commented 2 years ago

After some investigation, this is the exact line causing the issue with logging.config.dictConfig(logging_config).

Testing with the latest image + Python 3.9 + Kedro==0.18.0. This is a workaround that would make it works.

Update thie line in logging.yml

"disable_existing_loggers": True

Dockerfile

FROM quay.io/astronomer/astro-runtime:4.2.1
RUN pip install --user dist/new_kedro_project-0.1-py3-none-any.whl --ignore-requires-python

Minimal example to reproduce the error

A minimal example of KedroOperator.execute() to reproduce the issue. It's not entirely clear what's the issue, but disable the existing logger to fix the crash. Potentially it is conflicting with Airflow's own logger. We will revisit the way Kedro does logging soon and hopefully will fix this issue together.

    def execute(self, context):
        print("Hello World")
        config = {
            "version": 1,
            "disable_existing_loggers": False,
            "formatters": {
                "simple": {
                    "format": "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
                },

            },
            "handlers": {
                "console": {
                    "class": "logging.StreamHandler",
                    "level": "INFO",
                    "formatter": "simple",
                    "stream": "ext://sys.stdout",
                },
            },
            # Try uncomment this line, it will fail
            # "root": {
            #     "level": "INFO",
            #     "handlers": ["console",  ],
            # },
        }

        logging.config.dictConfig(
            config
        )  # Comment out this line, everything will break
        print("End of the Program")
DimedS commented 5 months ago

The Airflow Astronomer and AstroCloud deployment documentation was updated in #3792. Due to issues with the Rich library logging in Airflow deployments, one of the updated steps advises setting Kedro logging to [console] only. Deployments are now successfully working with Astro and other cloud providers.