Updated by @Donnype with the comment from @noamblitz.

About this feature

Detailed description

Currently, most boefjes are run in the boefje container as Python code. This has a few disadvantages:

Sandboxing They are not sandboxed, meaning they could potentially interact with things they are not supposed to interact with.
Reproducibility Since they run with Python code inside the container, users cannot reproduce the results without running the Python code themselves or rerunning the boefje.
Transparency It is hard to see what happens

Considerations:

Performance should not be affected too much
Transparency run commands should be visible in front-end to reach our goal
Developer experience should not be affected negatively too much
Migrating to this new container should be optional at first to ease pressure on QA

We propose an iterative approach to package all Python boefjes into a single container.

[x] 1. We create a single container with a Dockerfile in which we specify which boefjes should be copied, this container will be started and stopped on each boefje run. https://github.com/minvws/nl-kat-coordination/issues/3698
[ ] 2. We need to create specialized images of boefjes that used docker directly. https://github.com/minvws/nl-kat-coordination/issues/3855
[ ] 3. We create a ~HTTP API~ worker around this container so it does not have to be stopped every time. https://github.com/minvws/nl-kat-coordination/issues/3861
~[ ] 4. We create several runner files like kubernetes.py and docker.py so OPsers are able to choose how the boefjes will be run.~ UPDATE from @Donnype: we decided to postpone this as the kubernetes runner would only be needed for installs where we can talk to control planes and need to dynamically start newly created (containerized) boefjes.

Single container

Several decisions on the first step of the implementation:

In boefje.json we will specify the path to the main.py so the code around boefje resolving does not have to be added to the new container. This can be done later if needed.
The "old" way of running boefjes will still be possible to ease pressure off migration.
In the run command of the container, we will add an argument which points to either a boefje_id or the path to the boefje.json.

Feature benefit/User story

As an expert user, I want to be able to reproduce raw files of the current Python boefjes. To do this, KAT should communicate the run command of the container.

Additional information

Design

Screenshots

Include screenshots of the proposed design changes here.

Figma link

Link to the Figma design for further visualization (if applicable)

About this feature

Detailed description

Currently, most boefjes are run in the boefje container as Python code. This has a few disadvantages:

Sandboxing They are not sandboxed, meaning they could potentially interact with things they are not supposed to interact with.
Reproducibility Since they run with Python code inside the container, users cannot reproduce the results without running the Python code themselves or rerunning the boefje.
Transparency It is hard to see what happens

Considerations:

Performance should not be affected too much
Transparency run commands should be visible in front-end to reach our goal
Developer experience should not be affected negatively too much
Migrating to this new container should be optional at first to ease pressure on QA

We propose an iterative approach to package all Python boefjes into a single container.

We create a single container with a Dockerfile in which we specify which boefjes should be copied, this container will be started and stopped on each boefje run
We create a HTTP API around this container so it does not have to be stopped every time
We create several runner files like kubernetes.py and docker.py so OPsers are able to choose how the boefjes will be run.

Single container

Several decisions on the first step of the implementation:

In boefje.json we will specify the path to the main.py so the code around boefje resolving does not have to be added to the new container. This can be done later if needed.
The "old" way of running boefjes will still be possible to ease pressure off migration.
In the run command of the container, we will add an argument which points to either a boefje_id or the path to the boefje.json.

Feature benefit/User story

As an expert user, I want to be able to reproduce raw files of the current Python boefjes. To do this, KAT should communicate the run command of the container.

Additional information

Design

Screenshots

Include screenshots of the proposed design changes here.

Figma link

Link to the Figma design for further visualization (if applicable)

Containerizing boefjes

The reason to run all boefjes in a container is to run the boefje in a sandbox. In the future is will be possible to also run boefjes created by others, not only boefjes created by KAT. Running those in a sandbox decreases the risk of doing that.

Starting a container for every boefje task results in a lot of overhead, so we want to support running multiple tasks in a single container.

For the boefjes containers we need to support two ways of deploing KAT:

KAT has access to the container system control plane to start/stop containers. In this case KAT can automatically start new containers when necessary, but there needs to be runner that can talk to the control plane and start them.
KAT has no acess to the control plane and the system administrator configures all the necessary containers themself beforehand.

This means we need to support for long running boefje containers. This boefje can either pull the tasks from the runner or the runner can push the tasks to the boefje container if the boefje container has a service.

Pull-based design

Boefje is started as a container
Boefje makes a HTTP GET request to boefjes runner to fetch input
Boefje runs
Boefje submits output using HTTP POST request to boefjes runner
Boefje makes a seconds HTTP GET request to boefjes runner to fetch input
Boefje runs
Boefje submits output using HTTP POST request to boefjes runner

When there isn't any task available, the boefje can either wait on the boefjes runner for a new task to be available using long-polling or just do a new request after some timeout.

Push-based design

Specific boefje container is started
The container starts listening on HTTP
The boefjes runner sends HTTP request to container to start a job
Boefje runs
Boefje submits output using HTTP POST request to boefjes runner
The boefjes runner sends a second HTTP request to container to start a job
Boefje runs
Boefje submits output using HTTP POST request to boefjes runner

The pull-based design is how task queues are usually implemented, a process that executes tasks pulls the tasks from the queue.

Pushing tasks gives more complications if you want to scale to multiple boefje containers that execute tasks. How will the boefje runner know to which container to push the task? Some boefje tasks might take a very long time to execute, while other tasks might be short. If you want to use things like autoscaling and use a loadbalancer for the boefje HTTP service the question is how the load balancing should work with those very long running tasks. HTTP load balancers usually balance a high number of short duration HTTP requests, not long running tasks.

minvws / nl-kat-coordination

[EPIC] Package all local Python boefjes in a container (could be a single container, targeting the specific modules by its arguments) #3593

About this feature

Detailed description

Single container

Feature benefit/User story

Additional information

Design

Screenshots

Figma link

About this feature

Detailed description

Single container

Feature benefit/User story

Additional information

Design

Screenshots

Figma link

Containerizing boefjes

Pull-based design

Push-based design