Closed gvillafanetapia closed 2 years ago
Hey @gvillafanetapia
Thanks for reporting an issue.
Could you provide me with a small reproducible example? I'm just looking for some common ground to start debugging. It would be cool if you can comment your Airflow, Python, dbt, and airflow-dbt-python versions.
I'll have to look into this more in depth as it could be related to a long standing bug with logs (See: #60). The gist is that in an attempt to override dbt's logger to output to Airflow we may have messed things up. The biggest challenge is diving into the library dbt uses for logging (Logbook) as I'm not familiar with it.
Hi!
We are using;
The prod implementation is with a dynamic dag factory so the code is somewhat complicated but you can reproduce it with a simple run:
import pendulum
from pathlib import Path
from posixpath import join
from airflow.models.dag import dag
from airflow_dbt_python.operators.dbt import DbtRunOperator
root_dir = Path(__file__).absolute().parent.as_posix()
dbt_dir = join(root_dir, "test")
@dag(
start_date=pendulum.datetime(2022, 8, 30),
schedule_interval="@hourly",
catchup=False,
)
def test_dbt():
op = DbtRunOperator(
task_id='test_dbt_logs',
project_dir=dbt_dir,
profiles_dir=dbt_dir,
debug=True,
)
dag = test_dbt()
We implemented a function at the end of the execute that sends all logs and artifacts to S3 and there we noticed that it wasn't generating the dbt.log within the temp directory. I managed to get it to write but it only generates the stdout logs not the full logs that it writes in the file locally, but we might be on the right track:
@contextmanager
def override_dbt_logging(self, dbt_directory: str = None):
"""Override dbt's logger.
Starting with dbt v1, dbt initializes two loggers: default_file and
default_stdout. We override default_stdout's handlers with Airflow logger's
handlers.
"""
log_path = os.path.join(dbt_directory, "logs")
make_log_dir_if_missing(log_path)
log_dest = os.path.join(log_path, "dbt.log")
file_handler = RotatingFileHandler(
filename=log_dest,
encoding="utf8",
maxBytes=10 * 1024 * 1024,
backupCount=5, # 10 mb
)
file_passthrough_formatter = logging.Formatter(fmt="%(message)s")
file_handler.setFormatter(file_passthrough_formatter)
file_handler.setLevel(logging.DEBUG) # always debug regardless of user input
file_logger = logging.getLogger("default_file")
file_logger.handlers = [file_handler]
stdout_logger = logging.getLogger("default_stdout")
stdout_logger.setLevel(logging.DEBUG)
stdout_logger.handlers = airflow_log_handlers
yield
I believe the issue is just that we are getting rid of dbt's default_file
logger handlers. I originally dropped those as I was investigating a log duplicating issue (which was not fixed by that and it's still open, see: #60).
Given that removing those handlers doesn't help with #60, I'll go ahead and drop these lines: https://github.com/tomasfarias/airflow-dbt-python/blob/master/airflow_dbt_python/operators/dbt.py#L333.
I'll also add a few test assertions to ensure we are producing dbt log files. That will avoid that as we continue to iterate over fixes for #60 we do not regress on this issue.
I was not accurate, the issue is caused by we never running the setup_event_logger
function (defined here: https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/events/functions.py#L54), which in turn creates the logs
dir. I believe you were trying to emulate this function in override_dbt_logging
.
The implementation is likely well-oriented, but it ignores the user's input from the log-path
project configuration, and we should account for that. I'll see if we can fetch that and use setup_event_logger
ourselves.
I believe you were trying to emulate this function in override_dbt_logging
exactly
it ignores the user's input from the log-path project configuration
yes, I hadn't gotten there yet
I'll see if we can fetch that and use setup_event_logger ourselves.
🙌 i'll let you know if I find anything else from my end
I managed to make this work by basically calling setup_event_handler
when running the dbt task. Unfortunately it may not be enough: as we are running in a temporary directory, after we exit the dbt_directory
context manager the temporary directory and the logs
directory in it are deleted.
I guess you could subclass any operator and run your code before exiting the context manager. How are planning to save these logs?
I think using an absolute path as the log-path
configuration value may do the trick...
See: https://github.com/tomasfarias/airflow-dbt-python/pull/77. I've added a test case to show that using an absolute log-path pointing to a different directory (in this case a pytest temporary directory) saves the file to be used after execution.
This allows for a bit easier subclassing as we don't have to override dbt_directory
but just call super().execute()
and then use the logs directory. Perhaps airflow-dbt-python
should expose some more configuration parameters to control logging, but I'm happy with merging this solution if it will work for you (assuming tests are green, and perhaps after adding a couple more tests).
after we exit the dbt_directory context manager the temporary directory and the logs directory in it are deleted.
yes in our implementation we overrided the operator and are writing the logs and artifacts to S3 before the tmp dir is deleted we were hoping on making a PR here once the functionality was in production 😌
See: https://github.com/tomasfarias/airflow-dbt-python/pull/77.
awesome 🥳 I'll take a look
yes in our implementation we overrided the operator and are writing the logs and artifacts to S3 before the tmp dir is deleted
Cool, yeah, I was imagining something like that. PR #77 Should now support your use case and it's almost done. I'm currently iterating to fix a few tests (most of them due to Python 3.7 incompatibilities with newer versions of Python, I really want dbt to drop 3.7 :upside_down_face:).
Eventually we may add some additional logging functionality as I still feel we haven't properly merged Airflow logging with dbt logging.
Regardless, I'll have PR #77 done soon and release it with version 0.15 on the next few days.
It seems that the
dbt.log
file is not being written inside the temp directory generated by the run.Is there any extra config that needs to be done?
Thanks!