tomasfarias / airflow-dbt-python

A collection of Airflow operators, hooks, and utilities to elevate dbt to a first-class citizen of Airflow.
https://airflow-dbt-python.readthedocs.io
MIT License
170 stars 35 forks source link

Unable to use git remote hook with url.scheme in ("git+ssh", "ssh") #121

Open jrmidkiff opened 1 year ago

jrmidkiff commented 1 year ago

Hello, this relates to the discussion I created

I am unable to use DbtRunOperator with a private ssh git repo, and while I am unsure if my syntax is correct, I am encountering an error that leads me to believe that it is not my usage of the operator.

We are running

dbt_run = DbtRunOperator(
    dbt_conn_id="dbt-projects-github", # Airflow connection to private dbt-airflow github repository
    task_id="dbt_run",
    project_dir="git+ssh://github.com/OrganizationName/dbt-airflow",
    # project_conn_id=db_conn, 
    select=["+tag:daily"],
    exclude=["tag:deprecated"],
    target="db_conn", # Airflow Connection to data warehouse
    # profile="my-project",
)

which results in the following error:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/hooks/dbt.py", line 325, in dbt_directory
    store_profiles_dir,
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/hooks/dbt.py", line 369, in prepare_directory
    tmp_dir,
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/hooks/dbt.py", line 182, in download_dbt_project
    return remote.download_dbt_project(project_dir, destination)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/hooks/remote.py", line 73, in download_dbt_project
    self.download(source_url, destination_url)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/hooks/git.py", line 154, in download
    client, path = self.get_git_client_path(source)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/hooks/git.py", line 187, in get_git_client_path
    path = f"{url.netloc.split(':')[1]}/{str(url.path)}"
IndexError: list index out of range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/operators/dbt.py", line 173, in execute
    **vars(self),
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/hooks/dbt.py", line 234, in run_dbt_task
    env_vars=env_vars,
  File "/usr/local/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow_dbt_python/hooks/dbt.py", line 330, in dbt_directory
    ) from e
airflow.exceptions.AirflowException: Failed to prepare temporary directory for dbt execution

The url.netloc is github.com, and notably if we passed a github repo url to project_dir that used either git or http/https, then the following code would have run path = str(url.path) rather than path = f"{url.netloc.split(':')[1]}/{str(url.path)}" which appears to be the cause of the error.

Are you able to provide any assistance with this? Also, it would be great while we're struggling through these errors to also receive some feedback on the discussion I opened about this topic as well

Thank you very much!