kestra-io / plugin-fs

https://kestra.io/plugins/plugin-fs/
Apache License 2.0
6 stars 7 forks source link

Allow options to chunk logging of ssh outputs #141

Closed johnkm516 closed 2 months ago

johnkm516 commented 3 months ago

Feature description

I am using an ssh command to run a JAR file which runs some batch logic in a remote server. The output of the command is very verbose, with lots of data. Currently, the plugin outputs one log record per line of the shell output, which results in up to 30,000 database records being saved in Postgres per execution. I want to be able to save these logs, but not in such an inefficient way that every curly brace or a square bracket in a json output is saved as a separate log. A workaround is to first save the log into a variable and echo the output, like this :

tasks:
  - id: start_batch
    type: io.kestra.plugin.fs.ssh.Command
    commands:
      - cd /app/instances/cm-bat
      - output=$(./springBatch.sh <batch jar with print output>)
      - exit_code=$?
      - echo $output
      - exit $exit_code
    host: "{{ render(namespace.batch.dev.host) }}"  
    password: "{{ render(namespace.batch.dev.password) }}"
    username: "{{ render(namespace.batch.dev.username) }}"

But then the log is unformatted, and looks like this in the Kestra output :

image

I wish there was an option in the task to be able to chunk the log output while keeping all the formatting.

anna-geller commented 2 months ago

Hi John, I struggle to understand the issue. Do you want to capture outputs from a Shell task without generating a log?

Can you share a small example that allows us to reproduce the issue you see?

keep in mind that you can set a custom logLevel e.g. logLevel: WARN to avoid storing such INFO logs as shown above

johnkm516 commented 2 months ago

Hi John, I struggle to understand the issue. Do you want to capture outputs from a Shell task without generating a log?

Can you share a small example that allows us to reproduce the issue you see?

keep in mind that you can set a custom logLevel e.g. logLevel: WARN to avoid storing such INFO logs as shown above

Hi @anna-geller

Sorry, I don't think I explained myself very well. I want to generate a log, but I want to be able to chunk the log output such that each log record contains a number of lines of the output (set by an option in the ssh activity) rather than Kestra generating a log record for each line of the output. Here is an example of what I'm talking about :

id: batch-cm
namespace: erp.cm

tasks:
  - id: start_batch
    type: io.kestra.plugin.fs.ssh.Command
    commands:
      - cd /app/instances/cm-bat
      - ./springBatch.sh messageMakeJob

    host: "{{ render(namespace.batch.host) }}"  
    password: "{{ render(namespace.batch.password) }}"
    username: "{{ render(namespace.batch.username) }}"

  - id: Log
    type: io.kestra.plugin.core.log.Log
    message: '{{ outputs.start_batch.vars.batchlog }}'
concurrency:
  limit: 1
  behavior: CANCEL

./springBatch.sh executes a JAR file which outputs random JSON outputs, about 20,000 lines of output. When Kestra executes the above flow, the ssh command takes about 2 minutes 14 seconds to execute, because it's creating a log record line by line, like this :

image

If I modify the shell script to first store the output in a variable and output it like below :

id: batch-cm
namespace: erp.cm

tasks:
  - id: start_batch
    type: io.kestra.plugin.fs.ssh.Command
    commands:
      - cd /app/instances/cm-bat
      - output=$(./springBatch.sh messageMakeJob)
      - echo $output

    host: "{{ render(namespace.batch.host) }}"  
    password: "{{ render(namespace.batch.password) }}"
    username: "{{ render(namespace.batch.username) }}"

  - id: Log
    type: io.kestra.plugin.core.log.Log
    message: '{{ outputs.start_batch.vars.batchlog }}'
concurrency:
  limit: 1
  behavior: CANCEL

The activity takes significantly less time to execute, completing in around 6 seconds, but the output is not formatted :

image

For performance and storage reasons, creating a new log record in the DB for each line of the shell script output is an issue. Each log record in Kestra comes with metadata such as timestamp, level, thread, etc., all of which is replicated 20,000 times, one for each line of output in my above flow. In cases of outputting a prettified JSON, Kestra would be storing this metadata for a single line containing just a curly brace.

I would like the option to be able to output the log in batches, rather than line by line. For example, if I set the option to 1000, the worker would first read and load 1000 lines of the shell output into its memory before saving the output as a log record in the DB.

Edit :

I just thought of a much better solution, which would be an option to continuously output the shell output into a log file, rather than creating log records in the Kestra DB at all.

anna-geller commented 2 months ago

thanks for the additional context. We don't have much control over a Shell script to manipulate which logs are stored in memory and which are sent to kestra - we capture all stdout and stderr. For Python/Node.js tasks, we plan to add a dedicated kestra logger to give more control over what gets logged and at which log level https://github.com/kestra-io/kestra/issues/2451

for shell scripts, your idea to redirect shell output to a log file seems like a great solution! if you need more control over logging, you could consider building a custom Java plugin instead of using Shell (given the logic you are running is a Java application?), this would give you maximum control over what gets logged and how

johnkm516 commented 2 months ago

thanks for the additional context. We don't have much control over a Shell script to manipulate which logs are stored in memory and which are sent to kestra - we capture all stdout and stderr. For Python/Node.js tasks, we plan to add a dedicated kestra logger to give more control over what gets logged and at which log level kestra-io/kestra#2451

for shell scripts, your idea to redirect shell output to a log file seems like a great solution! if you need more control over logging, you could consider building a custom Java plugin instead of using Shell (given the logic you are running is a Java application?), this would give you maximum control over what gets logged and how

I'm not exactly wanting to manipulate which logs are being stored in memory and which are sent to Kestra - I want to capture all stdout and stderr as-is, but want to control when Kestra outputs the log. For example, if I have a BufferedReader of stdout and stderr like this :

Runtime rt = Runtime.getRuntime();
String[] commands = {"system.exe", "-get t"};
Process proc = rt.exec(commands);

BufferedReader stdInput = new BufferedReader(new 
     InputStreamReader(proc.getInputStream()));

BufferedReader stdError = new BufferedReader(new 
     InputStreamReader(proc.getErrorStream()));

// Read the output from the command
String s = null;
while ((s = stdInput.readLine()) != null) {
    System.out.println(s);
    //Kestra would insert a log record into the database here, line by line
}

// Read any errors from the attempted command
while ((s = stdError.readLine()) != null) {
    System.out.println(s);
}

Instead of the above implementation, I'm proposing the following solution :

Runtime rt = Runtime.getRuntime();
String[] commands = {"system.exe", "-get t"};
Process proc = rt.exec(commands);

BufferedReader stdInput = new BufferedReader(new InputStreamReader(proc.getInputStream()));
BufferedReader stdError = new BufferedReader(new InputStreamReader(proc.getErrorStream()));

int chunkSize = 1000; //the chunkSize would be an input attribute in the ssh activity

// Read the output from the command in chunks
StringBuilder outputBuffer = new StringBuilder();
String s;
int lineCount = 0;
while ((s = stdInput.readLine()) != null) {
    outputBuffer.append(s).append(System.lineSeparator());
    lineCount++;
    if (lineCount >= chunkSize) {
        processChunk(outputBuffer.toString()); //Kestra would insert a log record into the database here, the log chunked by 1000 lines
        outputBuffer.setLength(0); // Clear the buffer
        lineCount = 0;
    }
}
// Process any remaining lines
if (outputBuffer.length() > 0) {
    processChunk(outputBuffer.toString());
}

// Read any errors from the attempted command
while ((s = stdError.readLine()) != null) {
    System.out.println(s);
}

Additionally, an option to store the log into Kestra's internal storage like I suggested earlier would be great as well. The JAR file already stores logs locally. However there doesn't seem to be a way for me to download the log into Kestra's internal storage via ssh. I want to be able to use Kestra to see all the logs for the batch executions, directly within Kestra's UI.

anna-geller commented 2 months ago

Hi John, big thanks for adding more context. I let Yann and Paul from the team cross-check your request to validate our understanding of the problem. It seems that optimizing the speed of writing logs by switching to bulk/async instead of writing logs in real-time will address your issue. You can follow the progress of that work here https://github.com/kestra-io/kestra/issues/988

feel free to close this issue and follow along on the linked issue if 988 will address your needs

johnkm516 commented 2 months ago

Hi John, big thanks for adding more context. I let Yann and Paul from the team cross-check your request to validate our understanding of the problem. It seems that optimizing the speed of writing logs by switching to bulk/async instead of writing logs in real-time will address your issue. You can follow the progress of that work here kestra-io/kestra#988

feel free to close this issue and follow along on the linked issue if 988 will address your needs

988 does address my needs, but I do think as an addition to bulk/async an option to store logs in Kestra's internal storage for log outputs that we know will be very large (100,000 lines+) for script activities like python scripts, shell scripts, etc. would be super useful. It would be much less stressful on the database and Kestra's webserver when trying to read logs that are very large if we had a download button for the log instead for such cases.

I'll leave the issue open in case you have any other comments, otherwise please close the issue. Thank you very much for the prompt responses.

anna-geller commented 2 months ago

I opened a separate issue for that request here https://github.com/kestra-io/kestra/issues/4688

Thanks so much for clarifying, John!