stackabletech / airflow-operator

Stackable Operator for Apache Airflow
Other
22 stars 2 forks source link

Load DAGs per git-sync #177

Closed adwk67 closed 1 year ago

adwk67 commented 2 years ago

Issue #150 introduces one possibility for general management of external resources: for DAGs, the recommendation from Airflow is to use git-sync (see e.g. here for more info). This issue covers implementing git-sync in a container to regularly keep DAGs updated.

An example of this is given here. The airflow CRD can be changed to include an optional section shown below this (in airflow the roles are expected to have the same config, so the git-sync definition can be at top-level. This might need to be revisited if git-sync is a feature useful for other operators, such as nifi workflows etc.).

There are quite a number of parameters that can be set (see here and the sections that follow the link). Analagous to the SparkHistoryServerSpec the proposal is to define mandatory fields and expose the rest via a map;

...
  credentialsSecret: test-airflow-credentials
  gitSync:
    #---------------------------------------------
    # necessary and mandatory
    #---------------------------------------------
    name: git-sync
    repo: https://github.com/kubernetes/git-sync
    dagsDirectory: dags
    wait: 60
    #---------------------------------------------
    #  optional
    #---------------------------------------------
    gitSyncConf:
      - # depth: default 1
      ...
  webservers:
    roleGroups:
...

A git-sync container can only sync a single repo, so multiple repos would require a container each. A git-sync block may occur at top-level (as would be the case for e.g. airflow), or at role level, if the product only requires external git resources for a specific component (i.e. the init-container will only be created for that role).

Acceptance critiera

See https://github.com/stackabletech/docker-images/pull/337

fhennig commented 1 year ago

My understanding is that #150 needs to be implemented together with this, to allow for multiple different DAG loading mechanisms, is that right? @adwk67

Or is it possible to implement the git-sync mechanism without breaking changes?

Another question about my understanding: The operator would get some git links, and clone the repos when starting up, presumably into an emptyDir. Is that right? No provisioning of volumes by the user required.

adwk67 commented 1 year ago

My understanding is that #150 needs to be implemented together with this, to allow for multiple different DAG loading mechanisms, is that right? @adwk67 Or is it possible to implement the git-sync mechanism without breaking changes? Another question about my understanding: The operator would get some git links, and clone the repos when starting up, presumably into an emptyDir. Is that right? No provisioning of volumes by the user required.

lfrancke commented 1 year ago

Are you two in agreement now? If so I'm happy to move it on

sbernauer commented 1 year ago

Just throwing in the idea of including all the stuff from k8s.gcr.io/git-sync/git-sync:v3.2.2 in our Airflow Image. (We can e.g. extract it out of the upstream image) We were able to do so in all other products AFAIK, eliminating the need to specify any image for the user. Another benefit is not overseeing the image when it comes to building arm images

adwk67 commented 1 year ago

Just throwing in the idea of including all the stuff from k8s.gcr.io/git-sync/git-sync:v3.2.2 in our Airflow Image

I like this idea and will aim to incorporate it.

fhennig commented 1 year ago

I agree

backstreetkiwi commented 1 year ago

Does that mean I can finally remove the node labels? (see unticked box in "acceptance criteria")? paging @adwk67

sbernauer commented 1 year ago

Just a side note: I we still rely on any nodeSelectors in any test for any operator we should probably use affinity instead - which is now supported :)

adwk67 commented 1 year ago

Does that mean I can finally remove the node labels? (see unticked box in "acceptance criteria")?

Yes

lfrancke commented 1 year ago

I just browsed through the docs and see that the "PersistentVolumeClaim" has gone missing. Has that feature been removed?

adwk67 commented 1 year ago

I just browsed through the docs and see that the "PersistentVolumeClaim" has gone missing. Has that feature been removed?

Yes, it was problematic and was one of the reasons we wanted to have git-sync instead.

lfrancke commented 1 year ago

In that case though that removal needs to be documented in the changelog

adwk67 commented 1 year ago

The functionality is still there - we only removed the example/documentation. I've updated the changelog to reflect this in a separate PR.

lfrancke commented 1 year ago

Okay, I'm not sure I do understand why we remove the docs entirely then? Do we want to remove the functionality in the future?

adwk67 commented 1 year ago

Using PVCs for this kind of customer-facing thing is problematic as it depends on what type of PVC-access (read-write-many etc.) is available on a given cloud and if RWX is not available (as is the case with Ionos, for example) then node selection needs to be used to make sure this approach works. I don't think we should give the impression that we recommend or support it. And we didn't add any functionality for this: we just documented how it could be done.