kestra-io / plugin-gcp

Apache License 2.0
9 stars 10 forks source link

Add support for Regex pattern in `outputFiles` in GoogleBatchTaskRunner #362

Closed anna-geller closed 4 months ago

anna-geller commented 7 months ago

Feature description

Example showing the pattern *.json:

id: python_etl
namespace: dev

tasks:
  - id: extract
    type: io.kestra.plugin.fs.http.Download
    uri: https://dummyjson.com/products

  - id: transform
    type: io.kestra.plugin.scripts.python.Script
    docker:
      image: python:3.11-alpine 
    inputFiles:
      data.json: "{{ outputs.extract.uri }}"
    outputFiles:
      - "*.json"
    env:
      COLUMNS_TO_KEEP: "{{ inputs.columns_to_keep }}"
    script: |
      import json

      columns_to_keep = ["brand", "price"]

      with open("data.json", "r") as file:
          data = json.load(file)

      filtered_data = [
          {column: product[column] for column in columns_to_keep}
          for product in data["products"]
      ]
      with open("products.json", "w") as file:
          json.dump(filtered_data, file, indent=4)
loicmathieu commented 4 months ago

A more contrieved example:

id: outputFileRegex
namespace: company.team

tasks:
  - id: taskRunner
    type: io.kestra.plugin.scripts.shell.Commands
    outputFiles:
      - "*.txt"
    commands:
      - echo -n 'File 1' > {{workingDir}}/file1.txt
      - echo -n 'File 2' > {{workingDir}}/file2.txt