kestra-io / plugin-git

Apache License 2.0
3 stars 4 forks source link

Add synced files as outputs to the Git Sync task #47

Closed kriko closed 8 months ago

kriko commented 9 months ago

Feature description

plugin.git should produce structured outputs for changes. In more detail:

io.kestra.plugin.git.Clone outputs should have a list of files cloned from the repository - it would make easier iterating over the output, when we have a list of cloned files available

io.kestra.plugin.git.Push outputs should include a list of changes made to the destination repository - in as much detail as possible - eg. files added, files changed, files removed. For GitOps sync jobs, this would allow to monitor and alert when files have been changed locally.

io.kestra.plugin.git.Sync outputs should include a list of changes made to the local instance when syncing files from the git repository. Currently git.Sync only displays changes in it's logged output, but not in a machine readable format. Similarily to git.Push, it should list files in as much detail as possible - eg. files added, files changed, files removed.

Don't have a good idea to suggest implementation, but in general it should be possible to detect, which files were added, deleted, changed. Any additional Git output could be nice.

Found some examples on Git cli - to produce a structured JSON output, however these examples do not go into much detail for files changed, however they give an example of additional metadata in the structured output. https://gist.github.com/varemenos/e95c2e098e657c7688fd https://www.reddit.com/r/git/comments/udatkb/convert_your_git_log_output_to_json_with_the_new/

Did a bit digging in plugin.git code and it seems that the underlying library used for Git operations (org.eclipse.jgit) has support for git Status: https://javadoc.io/static/org.eclipse.jgit/org.eclipse.jgit/6.8.0.202311291450-r/org.eclipse.jgit/org/eclipse/jgit/api/Status.html which can maybe be used.

This was briefly discussed in Slack as well.

anna-geller commented 9 months ago

Could you say more about why it's important to capture that information in the outputs?

Generally, we tend to include in the outputs data that is meant to:

  1. be a downloadable artifact e.g. extracted data to make it avaiable as a report for business users
  2. pass data to downstream tasks

Git diff in Git tasks is meant for troubleshooting. That's why it's captured in logs.

I'd close the issue unless you can provide a bit more context for why this is needed in the output

if the issue is about improving the structure of logs in some way, we are definitely open to that 👍

kriko commented 9 months ago

The reasons for getting outputs from the Git tasks are:

  1. In a generic Git clone use-case - iterating over the cloned files in next steps of the flow, then taking action with those files.
  2. For git Push - knowing which files were changed locally, compared to the destination repository. Thus allowing options to either trigger alerts or do advanced logging.
  3. For git Sync - knowing which flows and files were synced from git, which local flows and files were removed or overwritten - gives a more fine grained ability to either do alerting or monitoring.
anna-geller commented 9 months ago

as discussed via Slack:

For 1, it's already possible! You would clone the repo within the Working Directory task and in the next step you can iterate over files in any script task

For 2 Git Push wouldn't solve the problem. We plan to add an Audit Log Trigger giving you fine-grained control for when to trigger alerts (e.g. when specific flows from specific namespace were changed or deleted)

For 3 Git webhook trigger would be better; afaik you can't use them in your organization, right? In that case we can consider adding outputs to git Sync only at first to help with that use case -- I'll rename the issue

anna-geller commented 9 months ago

a good implementation might be a map of outputs with the following keys:

{
  "namespaceFiles": {
    "deletions": [
      "script1.py"
    ],
    "additions": [
      "script2.py"
    ],
    "changes": [
      "script3.py"
    ]
  },
  "flows": {
    "deletions": [
      "_flows/flow1.yml"
    ],
    "additions": [
      "_flows/flow2.yml"
    ],
    "changes": [
      "_flows/flow3.yml"
    ]
  }
}

so that they can be accessed e.g. {{ outputs.mygitsync.namespaceFiles.additions}} -> will pass ["script2.py"]

anna-geller commented 8 months ago

The issue will be addressed in the new tasks PushFlows and SyncFlows 🎉 You can follow the progress and updates here https://github.com/kestra-io/plugin-git/issues/56