treasure-data / digdag

Workload Automation System
https://www.digdag.io/
Apache License 2.0
1.3k stars 221 forks source link

InvalidParameterException occurred only when executing sh operator on ECSCommandExecutor. #1749

Closed rariyama closed 1 year ago

rariyama commented 2 years ago

Thanks for this great library. I have a question about ECSCommandExecutor so please let me know the effective solutions if you have. This happened on digdag 0.10.4 and openjdk "1.8.0_312"

I wrote a workflow such as the following and execute by digdag run spot.dig -l debug.

# spot.dig
_export:
  !include : "config.dig" # This file contains environment variables.

+Task:
  +Py:
    _export:
      ecs:
        task_definition_arn: "arn:aws:ecs:ap-northeast-1:123456789:task-definition/py"
    py>: task.Sample.run

  +Sh:
    _export:
      ecs:
        task_definition_arn: "arn:aws:ecs:ap-northeast-1:123456789:task-definition/embulk"
    sh>: embulk run Sample.yml

  +Rb:
    _export:
      ecs:
        task_definition_arn: "arn:aws:ecs:ap-northeast-1:123456789:task-definition/rb"
    rb>: task.Sample.run

## config.dig
### the size of variables is larger than 8192byte
key1: value1
key2: value2
key3: value3
...

When I executed that, Py and Rb succeeded but Sh fell back and put an error such as the following. io.digdag.standards.command.EcsCommandExecutor: Fall back to DockerCommandExecutor: Task failed during the task submission Container Overrides length must be at most 8192 (Service: AmazonECS; Status Code: 400; Error Code: InvalidParameterException;

I can understand this error results from the size of variables, but I wonder why this error occurred only when executing sh>: operator. I assume py>: and rb>: should also fail for the same cause. Please elaborate on the reason and the effective ways to solve if there are.

rariyama commented 2 years ago

This PR is probably related to this issue. https://github.com/treasure-data/digdag/pull/1624

yoyama commented 2 years ago

I am not familiar with it, But each operator will pass the information which is specific to the operator. As result, the length of Container Override will be different.

rariyama commented 2 years ago

I'm sorry but I overlooked this PR was not merged into master. I understand the length of Container Override is not the cause for this problem. I'll search the difference between sh operator and the other ones.

rariyama commented 1 year ago

I found the reason about this problem. I'll describe it and the possible solution below.

The reason why ECS Run Task fails

ShOperator passes variables defined on workflow to the environment object and if the size of this environment object exceeds 8KB, ECS Run Task can fail. On the other hand, PyOperator and RbOperator pass them to input.json so ECS Run Task doesn't fail even if the size of them is larger than 8KB.

The possible solution

The possible solution is to make the implementation of ShOperator closer to the others. One idea is to write variables to runner.sh, not the environment object, and to export them when ECS Run Task executes. Now, only shell command is written on runner.sh by this, but if variables are also written on runner.sh by modifying this in addition to the command, it's possible to execute shell command locally and on ECS. Here is the example of runner.sh after changing.


export KEY1="VALUE1"
export KEY2="VALUE2"
export KEY3="VALUE3"

echo $KEY1 # the command

If there is no concern about this idea, I'll create a PR. Please tell me your opinion or question about that if you have.

hiroyuki-sato commented 1 year ago

Hello, @rariyama

In my(I'm not a maintainer of this project) opinion. I don't understand completely about this change. But, It seems that this change needs to escape some characters very carefully.

For example

It removes some files when this param is set. Is this correct?

export FOO="$( rm -rf /some/path/to )"
rariyama commented 1 year ago

Hello @hiroyuki-sato

Thanks for your reply. I'm sorry but your indication is right. Even though it might be possible to escape some characters, but it looks a bit too risky and complex to implement so this option is not positive I think.

As for ShOperator, Parameters are currently set on environments but it'll be better to implement the other way because values of environments are passed to ECS Container Override Environment and if the length of them exceeds 8KB, ECS Task fails due to the specification of ECS TaskOverride .

For example, PyOperator passes parameters to input.json and runner.py reads the file and execute process. In this way, It's possible to execute python process not by passing parameters to ECS Container Override Environment.

It's better to make the implementation of ShOperator closer to PyOperator or RbOperator if possible, but unfortunately I haven't come up with any promising idea. If you have knowledge, please tell me about it:pray:

hiroyuki-sato commented 1 year ago

Hello, @rariyama

I think STDIN is a candidate. It needs a shell script for parsing STDIN. It is not compatible current operator. So, I think it is better to develop a custom operator.

This plugin may help. digdag-plugin-shresult.

You can find community operators.

rariyama commented 1 year ago

Hi @hiroyuki-sato Thank you for your suggestion. Please let me confirm my understanding. It's effective to create a shell script which can receive a parameter file as STDIN and parse it, and if doing so, it's better to contribute the existing operator or create a new one because the current operator doesn't have a compatibility. is that right?

hiroyuki-sato commented 1 year ago

Hello, @rariyama

YES. my previous post means.

I understand if the the sh> operator supports over 8K environment variables, It would be very useful. Unfortunately, It is hard to implement with keep compatibility.