mitodl / ol-data-platform

Pipeline definitions for managing data flows to power analytics at MIT Open Learning
BSD 3-Clause "New" or "Revised" License
36 stars 6 forks source link

Standardize ELT steps as a Dagster pipeline #232

Closed blarghmatey closed 1 year ago

blarghmatey commented 2 years ago

A large fraction of our pipelines will be a standard pattern of:

We need to have modularized ops that capture each of the above steps so that we can parameterize them easily as a job that accepts a definition of which connection to sync and which dbt job to run.

There are existing plugins for Dagster for interfacing with Airbyte and dbt. We have also created an initial library for interfacing with glue, though it will likely need some updates.

shaidar commented 2 years ago

Wrote the code to handle basic creation of AWS Glue crawler using the API. Ran into an issue trying to communicate with the Airbyte API. Ended up torching my docker setup and rerunning it which seemed to work so am now writing the code to trigger airbyte sync.

shaidar commented 2 years ago

Initially went down the wrong path of writing python code from scratch to handle crawler creation and started working on the Airbyte bit as I assumed that we would be calling a python script through Dagster. Then I realized there are integration in Dagster to handle airbyte. Setup a test db, S3 bucket, local airbyte, and local dagster and went through some of the dagster tutorials to get familiar with the tool. Ran an airbyte sync which seems to work as expected as long as the passed connection ID is correct.

shaidar commented 2 years ago

Ran into some roadblocks and trying to figure out how to work through them in Dagster. Conceptually feels like dagster is very similar to to Concourse and unfortunately their documentation is also as sparse.

Apart from hashing out those issues, I have three separate working as expected and the focus now is on tying them together.

shaidar commented 2 years ago

Resolved the issue where the airbyte op was passing despite the sync in the UI failing. The sync in test is still failing around schema validation so will take a quick look at that to get it resolved as the next step is to run through all three ops (sync airbyte, setup glue crawler, and trigger dbt) and verify that they all go through. Once that works (should be today), then I just need to templatize this as a dagster pipeline so that we can pass in the airbyte connection_id and the dbt model and it should setup and trigger a pipeline.

shaidar commented 2 years ago

Ran into dependency hell with dagster-dbt. Running dbt in dagster throws the following error:

2022-08-11 07:26:48 -0500 - dagster - INFO - airbyte_job - 7d79d806-cac6-48cb-9224-4dcc5367f05a - run_dbt -     import jinja2._compat
2022-08-11 07:26:48 -0500 - dagster - INFO - airbyte_job - 7d79d806-cac6-48cb-9224-4dcc5367f05a - run_dbt - ModuleNotFoundError: No module named 'jinja2._compat'
2022-08-11 07:26:48 -0500 - dagster - INFO - airbyte_job - 7d79d806-cac6-48cb-9224-4dcc5367f05a - run_dbt - dbt exited with return code 1

It appears that dagster-dbt installs a very old version of dbt-core(0.14.2) with latest version being 1.2.0. Spent a decent amount of time trying to play around with dependencies to get get things resolved, but haven't been successful so far.

shaidar commented 2 years ago

Resolved dependency issue by installing dbt-trino and that added up installing a newer version of dbt-core. Nothing in the docs, guides or tutorials mention anything about that. Next was figuring out how to run dbt deps using the dbt cli resource. Again, documentation isn't very clear and ended up reaching out to them through their Slack support channel and I was able to get dbt to run.

shaidar commented 2 years ago

Started looking into how to wait on the crawler to complete its run before moving on to the dbt run. The crawler API is a bit limited and initially I was going to trigger the crawler and then keep checking (it throws an exception if u try to start the crawler and it's already running) until it's no longer throwing an error. However looking some more there's an API that reports the status of the crawler once we pass its and we can keep polling it until its status is not Running. Had a chat with Tobias about it and after some online digging, his recommendation was to cut the crawler out of the picture and look into using AWS Lake Formation which would potentially simplify things as it also integrates as a destination with Airbyte. So, looking now into Lake Formation to see how practical it is to use instead of the Glue crawler