Closed dwinston closed 3 years ago
@dwinston I think have a sequence diagram like the one you've have an example of could be really helpful. I haven't used Apache Airflow. I have use Apache NiFi, though. My experience with it was okay. A bit of a learning curve.
After trying Airflow, I think it's story is a little shaky at this point due to v2 being pretty new. Their docs for a dockerized setup had a few bugs that I needed to work around.
I'm a bit more bullish at the moment about Dagster, which among other features has official support for including Jupyter notebooks with data dependencies in ETL pipelines.
@jbeezley tagging you on this thread to think about what an ETL pipeline running on Spin could look like, one that gets new data / data changes to show on the pilot site without much intervention from you or me or @wdduncan .
The only technical constraint that I know of is that you can only (currently) expose http/https services externally. So as long as Dagster communicates through http, it should be possible.
Superseded by #316
I am thinking it could be good to set up a NERSC Spin service to run metadata ETL with the help of Apache Airflow and the Python Great Expectations library. I don't have much operational experience with these tools, but they seem like a good fit for NMDC needs and for migrating notebook-based workflows to something that is a bit more systematic and also observable (i.e. browser-based monitoring and reports), while still being Python-based.
The below is my sketch for a Spin service. I asked for feedback on the nerscusers Slack spin channel. @wdduncan @cmungall @dehays let me know your thoughts.
FYI I generated the above using the Mermaid tool: