scheduled, observable ETL with validation reports?

dwinston commented 3 years ago

I am thinking it could be good to set up a NERSC Spin service to run metadata ETL with the help of Apache Airflow and the Python Great Expectations library. I don't have much operational experience with these tools, but they seem like a good fit for NMDC needs and for migrating notebook-based workflows to something that is a bit more systematic and also observable (i.e. browser-based monitoring and reports), while still being Python-based.

The below is my sketch for a Spin service. I asked for feedback on the nerscusers Slack spin channel. @wdduncan @cmungall @dehays let me know your thoughts.

spin-airflow-mongo-cfs-ge

FYI I generated the above using the Mermaid tool:

sequenceDiagram
    participant CFS as NERSC CFS
    participant NFS as Spin NFS
    participant Air as Airflow
    participant Py as Worker<br/>(same container<br/>as Airflow?)
    participant Mongo as MongoDB
    participant GE as GreatExpectations
    Air->>NFS: get scheduled workflow
    NFS->>Air: flow.py from NFS volume
    Note over Air: Serve monitoring UI
    Air->>+Py: run flow.py
    Py->>Mongo: connect to db
    loop For each subtask
      Py->>Mongo: get from db
      Py->>CFS: get from CFS
      Py->>Py: execute logic
      Py->>GE: run validations
      Note over GE: Serve HTML report
      opt Save
        Py->>Mongo: write to db
      end
    end
    Py->>Mongo: write to db
    Py->>-Air: done

wdduncan commented 3 years ago

@dwinston I think have a sequence diagram like the one you've have an example of could be really helpful. I haven't used Apache Airflow. I have use Apache NiFi, though. My experience with it was okay. A bit of a learning curve.

dwinston commented 3 years ago

After trying Airflow, I think it's story is a little shaky at this point due to v2 being pretty new. Their docs for a dockerized setup had a few bugs that I needed to work around.

I'm a bit more bullish at the moment about Dagster, which among other features has official support for including Jupyter notebooks with data dependencies in ETL pipelines.

@jbeezley tagging you on this thread to think about what an ETL pipeline running on Spin could look like, one that gets new data / data changes to show on the pilot site without much intervention from you or me or @wdduncan .

jbeezley commented 3 years ago

The only technical constraint that I know of is that you can only (currently) expose http/https services externally. So as long as Dagster communicates through http, it should be possible.

dwinston commented 3 years ago

Superseded by #316

microbiomedata / nmdc-metadata

scheduled, observable ETL with validation reports? #267