quantile-development / dagster-meltano

A Dagster plugin that allows you to run Meltano in Dagster
MIT License
41 stars 17 forks source link

How to execute meltano prod runs with dagster? #44

Closed ReneTC closed 9 months ago

ReneTC commented 9 months ago

I am really sorry about me spamming this repo. I see a huge potential in it and I am already quite invested.

I've been running into a problem, I am interested if someone already solved it.

You can easily make a job run in dagster with the repo and putting this in the meltano.yml :

- name: raw_data_to_duckdb
  tasks:
  - tap-spreadsheets-anywhere target-duckdb

However, if you need to run this in prod, you should add the flag --environment=prod, so:

- name: raw_data_to_duckdb_prod
  tasks:
  - --environment=prod  tap-spreadsheets-anywhere target-duckdb

But running meltano invoke dagster:start Results in an error: dagster._core.errors.DagsterInvalidDefinitionError: "__environment=prod__tap_spreadsheets_anywhere_target_duckdb" is not a valid name in Dagster. Names must be in regex ^[A-Za-z0-9_]+$

Any ideas?

ReneTC commented 9 months ago

Trying with using meltano _run_op() same issue

from dagster import repository, job
from dagster_meltano import meltano_resource, meltano_run_op

@job(resource_defs={"meltano": meltano_resource})
def meltano_run_job():
    tap_done = meltano_run_op("-environment=prod  tap-1 target-1")()
    meltano_run_op("-environment=prod  tap-2 target-2")(tap_done)

@repository()
def repository():
    return [meltano_run_job]

gives same error

ReneTC commented 9 months ago

Seems to me this could be fixed by chaning the dagster name here

Just remove everything that is not in the regex ^[A-Za-z0-9_]+$ but make sure the executed command is not the same as the dagster name

JulesHuisman commented 9 months ago

This could either be fixed here: https://github.com/quantile-development/dagster-meltano/blob/1b3022cbd687c65ccd9288f767397efcd2e587ca/dagster_meltano/utils.py#L15-L19 By also replacing the =.

But it might be easier to set the MELTANO_ENVIRONMENT to prod.

ReneTC commented 9 months ago

Would you like me to fix it, test it, and send a MR? (might first be done tomorrow). For me the replacing of = works best. But I am not sure of the direction you want to go as the package owner.

JulesHuisman commented 9 months ago

Would be great! I will see the PR appear.

ReneTC commented 9 months ago

Draft here: https://github.com/quantile-development/dagster-meltano/pull/45 I was not able to test it, I was confused how Meltano install this package.

I know you can add custom github urls (i.e my fork to test) to a package like so:

  - name: dagster
    variant: quantile-development
    pip_url: dagster-ext git+https://github.com/my_fork.git
    config:
      repository_dir: ${MELTANO_PROJECT_ROOT}/orchestrate

But I I am not sure where to switch out the main package dagster-meltano with a custom git url

ReneTC commented 9 months ago

Okay after this is merged https://github.com/quantile-development/dagster-meltano/pull/47 it sadly does not work yet. If I have the prod task in meltano.yml

- name: task1
  tasks:
  - tap-spreadsheets-anywhere target-duckdb
- name: task1_prod
  tasks:
  - tap-spreadsheets-anywhere target-duckdb --environment=prod

When dagster-meltano runs, it will execute: meltano run tap-spreadsheets-anywhere target-duckdb --environment=prod but that is wrong it it returns the error: Error: No such option: --environment

Correct syntax is meltano --environment=prod run tap-spreadsheets-anywhere target-duckdb but I don't see how that is possible with the package here. I've asked in meltano slack how to execute a dagster run in another env here.

JulesHuisman commented 9 months ago

You should use the MELTANO_ENVIRONMENT variable to specify which environment to use.

ReneTC commented 9 months ago

Thanks Jules but I don't see how to use MELTANO_ENVIRONMENT in this example. Do you mind providing an example?

JulesHuisman commented 9 months ago

For example, we deploy Meltano using a Docker container. In the Docker container we set:

ENV MELTANO_ENVIRONMENT=prod

That way we run meltano in production in our production environment.

ReneTC commented 9 months ago

Thanks for your specific example @JulesHuisman I appreciate that. However, we are not using a docker container so that solution does not fix the issue.

I found one kinda-working-solution. If you run: meltano --environment=prod invoke dagster:start All of the jobs will be executed as prod. Not ideal, because if you want to run dagster as --environment=dev next time, the dagster logs does not distinguish and so exeucution time, number of fails and so on is very confusing to see in the dagster UI.

I wouldn't mark this as closed, at least for my case. Possible solutions for me, could be an meltano-dagster operator that also accepts env as input, i.e something like:

    return meltano_command_op_with_env(
        command=f"--environment={env} run {command} --force", dagster_name=dagster_name
    )

But I am not sure it is the direction to go.