tomasfarias / airflow-dbt-python

A collection of Airflow operators, hooks, and utilities to elevate dbt to a first-class citizen of Airflow.
https://airflow-dbt-python.readthedocs.io
MIT License
178 stars 36 forks source link

dbt not writing log to file #73

Closed gvillafanetapia closed 2 years ago

gvillafanetapia commented 2 years ago

It seems that the dbt.log file is not being written inside the temp directory generated by the run.

Is there any extra config that needs to be done?

Thanks!

tomasfarias commented 2 years ago

Hey @gvillafanetapia

Thanks for reporting an issue.

Could you provide me with a small reproducible example? I'm just looking for some common ground to start debugging. It would be cool if you can comment your Airflow, Python, dbt, and airflow-dbt-python versions.

I'll have to look into this more in depth as it could be related to a long standing bug with logs (See: #60). The gist is that in an attempt to override dbt's logger to output to Airflow we may have messed things up. The biggest challenge is diving into the library dbt uses for logging (Logbook) as I'm not familiar with it.

gvillafanetapia commented 2 years ago

Hi!

We are using;

The prod implementation is with a dynamic dag factory so the code is somewhat complicated but you can reproduce it with a simple run:

import pendulum

from pathlib import Path
from posixpath import join
from airflow.models.dag import dag
from airflow_dbt_python.operators.dbt import DbtRunOperator

root_dir = Path(__file__).absolute().parent.as_posix()
dbt_dir = join(root_dir, "test")

@dag(
    start_date=pendulum.datetime(2022, 8, 30),
    schedule_interval="@hourly",
    catchup=False,
)
def test_dbt():
    op = DbtRunOperator(
        task_id='test_dbt_logs',
        project_dir=dbt_dir,
        profiles_dir=dbt_dir,
        debug=True,
    )

dag = test_dbt()

We implemented a function at the end of the execute that sends all logs and artifacts to S3 and there we noticed that it wasn't generating the dbt.log within the temp directory. I managed to get it to write but it only generates the stdout logs not the full logs that it writes in the file locally, but we might be on the right track:

@contextmanager
    def override_dbt_logging(self, dbt_directory: str = None):
        """Override dbt's logger.

        Starting with dbt v1, dbt initializes two loggers: default_file and
        default_stdout. We override default_stdout's handlers with Airflow logger's
        handlers.
        """

        log_path = os.path.join(dbt_directory, "logs")

       make_log_dir_if_missing(log_path)
       log_dest = os.path.join(log_path, "dbt.log")

       file_handler = RotatingFileHandler(
           filename=log_dest,
           encoding="utf8",
           maxBytes=10 * 1024 * 1024,
           backupCount=5,  # 10 mb
       )
       file_passthrough_formatter = logging.Formatter(fmt="%(message)s")
       file_handler.setFormatter(file_passthrough_formatter)
       file_handler.setLevel(logging.DEBUG)  # always debug regardless of user input

       file_logger = logging.getLogger("default_file")
       file_logger.handlers = [file_handler]

       stdout_logger = logging.getLogger("default_stdout")
       stdout_logger.setLevel(logging.DEBUG)
       stdout_logger.handlers = airflow_log_handlers

        yield
tomasfarias commented 2 years ago

I believe the issue is just that we are getting rid of dbt's default_file logger handlers. I originally dropped those as I was investigating a log duplicating issue (which was not fixed by that and it's still open, see: #60).

Given that removing those handlers doesn't help with #60, I'll go ahead and drop these lines: https://github.com/tomasfarias/airflow-dbt-python/blob/master/airflow_dbt_python/operators/dbt.py#L333.

I'll also add a few test assertions to ensure we are producing dbt log files. That will avoid that as we continue to iterate over fixes for #60 we do not regress on this issue.

tomasfarias commented 2 years ago

I was not accurate, the issue is caused by we never running the setup_event_logger function (defined here: https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/events/functions.py#L54), which in turn creates the logs dir. I believe you were trying to emulate this function in override_dbt_logging.

The implementation is likely well-oriented, but it ignores the user's input from the log-path project configuration, and we should account for that. I'll see if we can fetch that and use setup_event_logger ourselves.

gvillafanetapia commented 2 years ago

I believe you were trying to emulate this function in override_dbt_logging

exactly

it ignores the user's input from the log-path project configuration

yes, I hadn't gotten there yet

I'll see if we can fetch that and use setup_event_logger ourselves.

🙌 i'll let you know if I find anything else from my end

tomasfarias commented 2 years ago

I managed to make this work by basically calling setup_event_handler when running the dbt task. Unfortunately it may not be enough: as we are running in a temporary directory, after we exit the dbt_directory context manager the temporary directory and the logs directory in it are deleted.

I guess you could subclass any operator and run your code before exiting the context manager. How are planning to save these logs?

tomasfarias commented 2 years ago

I think using an absolute path as the log-path configuration value may do the trick...

tomasfarias commented 2 years ago

See: https://github.com/tomasfarias/airflow-dbt-python/pull/77. I've added a test case to show that using an absolute log-path pointing to a different directory (in this case a pytest temporary directory) saves the file to be used after execution.

tomasfarias commented 2 years ago

This allows for a bit easier subclassing as we don't have to override dbt_directory but just call super().execute() and then use the logs directory. Perhaps airflow-dbt-python should expose some more configuration parameters to control logging, but I'm happy with merging this solution if it will work for you (assuming tests are green, and perhaps after adding a couple more tests).

gvillafanetapia commented 2 years ago

after we exit the dbt_directory context manager the temporary directory and the logs directory in it are deleted.

yes in our implementation we overrided the operator and are writing the logs and artifacts to S3 before the tmp dir is deleted we were hoping on making a PR here once the functionality was in production 😌

See: https://github.com/tomasfarias/airflow-dbt-python/pull/77.

awesome 🥳 I'll take a look

tomasfarias commented 2 years ago

yes in our implementation we overrided the operator and are writing the logs and artifacts to S3 before the tmp dir is deleted

Cool, yeah, I was imagining something like that. PR #77 Should now support your use case and it's almost done. I'm currently iterating to fix a few tests (most of them due to Python 3.7 incompatibilities with newer versions of Python, I really want dbt to drop 3.7 :upside_down_face:).

Eventually we may add some additional logging functionality as I still feel we haven't properly merged Airflow logging with dbt logging.

Regardless, I'll have PR #77 done soon and release it with version 0.15 on the next few days.