hobbit-project / platform

HOBBIT benchmarking platform
GNU General Public License v2.0
23 stars 9 forks source link

A crashing system may not stop an experiment in development mode #560

Open MichaelRoeder opened 9 months ago

MichaelRoeder commented 9 months ago

Description

If the platform is in develop mode and the system crashes before the benchmark is ready the benchmark container is not terminated but the experiment times out instead. An example of the platform controller log is given below:

2023-12-15 10:29:04,198 INFO [org.hobbit.controller.ExperimentManager] - <Creating next experiment 1702635957330 with benchmark http://w3id.org/dice-research/bbdc/ontology#Benchmark and system http://w3id.org/dice-research/bbdc/ontology#FaultyExampleSystem to the queue.>
2023-12-15 10:29:04,199 INFO [org.hobbit.controller.ExperimentManager] - <Starting new RabbitMQ for the experiment...>
2023-12-15 10:29:04,199 WARN [org.hobbit.controller.docker.ContainerManagerImpl] - <Skipping image pulling because DOCKER_AUTOPULL is unset>
2023-12-15 10:29:04,209 INFO [org.hobbit.controller.docker.ContainerManagerImpl] - <Will not remove container benchmark-317a2c6705c74d07bd9480e1426c87d9. Development mode is enabled.>
2023-12-15 10:29:04,229 WARN [org.hobbit.controller.docker.ContainerManagerImpl] - <The swarm cluster got only 1 node, I will not use placement constraints.>
2023-12-15 10:29:06,802 INFO [org.hobbit.controller.docker.ContainerManagerImpl] - <Container rabbitm-1998fac84b2242d2940ae4422d0d889d created>
2023-12-15 10:29:06,802 INFO [org.hobbit.controller.ExperimentManager] - <Using the newly started RabbitMQ for the experiment: rabbitm-1998fac84b2242d2940ae4422d0d889d>
2023-12-15 10:29:06,802 INFO [org.hobbit.core.components.AbstractCommandReceivingComponent] - <This component will handle received commands in a single thread.>
2023-12-15 10:29:06,802 INFO [org.hobbit.controller.PlatformController] - <Setting experiment's RabbitMQ connector for the command queue: {rabbitMQHostName=rabbitm-1998fac84b2242d2940ae4422d0d889d}>
2023-12-15 10:29:06,803 WARN [org.hobbit.core.components.AbstractComponent] - <Couldn't connect to RabbitMQ with try #0. Next try in 5000ms.>
2023-12-15 10:29:11,813 INFO [org.hobbit.core.components.AbstractCommandReceivingComponent] - <Couldn't get the id of this Docker container. Won't be able to create containers.>
2023-12-15 10:29:11,817 WARN [org.hobbit.controller.docker.ContainerManagerImpl] - <Skipping image pulling because DOCKER_AUTOPULL is unset>
2023-12-15 10:29:11,818 WARN [org.hobbit.controller.docker.ContainerManagerImpl] - <Skipping image pulling because DOCKER_AUTOPULL is unset>
2023-12-15 10:29:11,818 ERROR [org.hobbit.controller.ExperimentManager] - <Could not load timeouts config (config/config.yaml (No such file or directory)). Using default value 1200000ms.>
2023-12-15 10:29:11,818 INFO [org.hobbit.controller.data.ExperimentStatus] - <Creating abort timer for http://w3id.org/hobbit/experiments#1702635957330 with 1200000ms.>
2023-12-15 10:29:11,818 INFO [org.hobbit.controller.ExperimentManager] - <Creating benchmark controller registry.gitlab.csl.uni-bremen.de/bbdc/bbdc-2024/bbdc-hobbit-benchmark/benchmark>
2023-12-15 10:29:11,818 WARN [org.hobbit.controller.docker.ContainerManagerImpl] - <Skipping image pulling because DOCKER_AUTOPULL is unset>
2023-12-15 10:29:11,857 WARN [org.hobbit.controller.docker.ContainerManagerImpl] - <The swarm cluster got only 1 node, I will not use placement constraints.>
2023-12-15 10:29:13,446 INFO [org.hobbit.controller.docker.ContainerManagerImpl] - <Container benchmark-a2d2a317f1ce41338bacca7acab44a52 created>
2023-12-15 10:29:13,446 INFO [org.hobbit.controller.ExperimentManager] - <Creating system registry.gitlab.csl.uni-bremen.de/bbdc/bbdc-2024/bbdc-hobbit-benchmark/system-faulty>
2023-12-15 10:29:13,446 WARN [org.hobbit.controller.docker.ContainerManagerImpl] - <Skipping image pulling because DOCKER_AUTOPULL is unset>
2023-12-15 10:29:13,515 WARN [org.hobbit.controller.docker.ContainerManagerImpl] - <The swarm cluster got only 1 node, I will not use placement constraints.>
2023-12-15 10:29:14,651 INFO [org.hobbit.controller.docker.ContainerManagerImpl] - <Container system-faulty-2531b0e415b5437daab72ca3154d2382 created>
2023-12-15 10:29:14,651 INFO [org.hobbit.controller.ExperimentManager] - <Finished starting of new experiment.>
2023-12-15 10:29:19,047 INFO [org.hobbit.controller.PlatformController] - <Container system-faulty-2531b0e415b5437daab72ca3154d2382 stopped with exitCode=0>
2023-12-15 10:29:19,047 INFO [org.hobbit.controller.ExperimentManager] - <The system has been stopped before the benchmark has been started. Aborting.>
2023-12-15 10:29:19,067 INFO [org.hobbit.controller.docker.ContainerManagerImpl] - <Will not remove container rabbitm-1998fac84b2242d2940ae4422d0d889d. Development mode is enabled.>
2023-12-15 10:29:19,091 INFO [org.hobbit.controller.docker.ContainerManagerImpl] - <Will not remove container system-faulty-2531b0e415b5437daab72ca3154d2382. Development mode is enabled.>
2023-12-15 10:29:19,110 INFO [org.hobbit.controller.docker.ContainerManagerImpl] - <Will not remove container benchmark-a2d2a317f1ce41338bacca7acab44a52. Development mode is enabled.>
2023-12-15 10:29:19,132 INFO [org.hobbit.controller.docker.ContainerManagerImpl] - <Will not remove container system-faulty-2531b0e415b5437daab72ca3154d2382. Development mode is enabled.>
2023-12-15 10:29:34,043 INFO [org.hobbit.controller.PlatformController] - <received command: session=1702635957330, command=BENCHMARK_READY_SIGNAL>

After the last message, the platform does nothing until the experiment timer runs out.

Reason

Stopping the experiment is triggered internally by stopping all containers that belong to the experiment including the benchmark container. The implementation of the ExperimentManager assumes that it will be notified that the other containers stopped and that it cleans up the experiment based on this information. However, the Docker swarm-based ContainerManagerImpl class does not distinguish between stopping a container and removing it. Since the develop mode forbids the removal, the call to stop the containers actually does nothing and the ExperimentManager never receives the call that a container is stopped. Hence, it will not stop the experiment before it is timed out.

Reproducability

  1. Run an experiment with a system that directly terminates when it is started (the exit code doesn't seem to matter)
  2. Watch the experiment to run further although it should have been stopped.

Expected behavior

The platform should terminate the benchmark OR the log messages should inform the admin that they should remove the containers manually. The latter solution would assume that the develop mode is actually only used for development.

MichaelRoeder commented 9 months ago

Update

The system in the example doesn't really crash. Instead, it looks like the image of the system is unknown (because of a typo) and there is no container that starts at all. While the Docker service seems to be aware of the problem (see inspect dump further below), the platform doesn't seem to recognize that the image does not exist (Note: AUTO_PULL has been turned off).

micha@mprec:~/workspace/platform$ docker inspect wbjke9en2wfp
[
    {
        "ID": "wbjke9en2wfpj3k1hlrizajsk",
        "Version": {
            "Index": 4963
        },
        "CreatedAt": "2023-12-15T10:25:33.581954821Z",
        "UpdatedAt": "2023-12-15T10:25:34.531438292Z",
        "Labels": {},
        "Spec": {
            "ContainerSpec": {
                "Image": "registry.gitlab.csl.uni-bremen.de/bbdc/bbdc-2024/bbdc-hobbit-benchmark/system-faulty",
                "Labels": {
                    "org.hobbit.parent": "rabbitm-12af1ecf439e4c7a8ac21316ef5e3425",
                    "org.hobbit.type": "system"
                },
                "Hostname": "system-faulty-3f4859075e3d415ab65c4f86f2e51893",
                "Env": [
                    "HOBBIT_CONTAINER_NAME=system-faulty-3f4859075e3d415ab65c4f86f2e51893",
                    "HOBBIT_HARDWARE_NODES=1",
                    "HOBBIT_HARDWARE_NODES_SYSTEM=0",
                    "HOBBIT_HARDWARE_NODES_BENCHMARK=0",
                    "HOBBIT_RABBIT_HOST=rabbitm-12af1ecf439e4c7a8ac21316ef5e3425",
                    "HOBBIT_SESSION_ID=1702635354281",
                    "SYSTEM_PARAMETERS_MODEL={\n  \"@graph\" : [ {\n    \"@id\" : \"http://w3id.org/dice-research/bbdc/ontology#FaultyExampleSystem\",\n    \"@type\" : \"hobbit:SystemInstance\",\n    \"imageName\" : \"registry.gitlab.csl.uni-bremen.de/bbdc/bbdc-2024/bbdc-hobbit-benchmark/system-faulty\",\n    \"implementsAPI\" : \"http://w3id.org/dice-research/bbdc/ontology#API\",\n    \"comment\" : {\n      \"@language\" : \"en\",\n      \"@value\" : \"An example system that will crash.\"\n    },\n    \"label\" : {\n      \"@language\" : \"en\",\n      \"@value\" : \"Faulty System\"\n    }\n  }, {\n    \"@id\" : \"http://w3id.org/dice-research/bbdc/ontology#ModelLoadingError\",\n    \"comment\" : \"An error occurred while trying to load the model.\",\n    \"label\" : \"Model loading error\",\n    \"subClassOf\" : \"alg:Error\"\n  } ],\n  \"@context\" : {\n    \"label\" : {\n      \"@id\" : \"http://www.w3.org/2000/01/rdf-schema#label\"\n    },\n    \"comment\" : {\n      \"@id\" : \"http://www.w3.org/2000/01/rdf-schema#comment\"\n    },\n    \"implementsAPI\" : {\n      \"@id\" : \"http://w3id.org/hobbit/vocab#implementsAPI\",\n      \"@type\" : \"@id\"\n    },\n    \"imageName\" : {\n      \"@id\" : \"http://w3id.org/hobbit/vocab#imageName\"\n    },\n    \"subClassOf\" : {\n      \"@id\" : \"http://www.w3.org/2000/01/rdf-schema#subClassOf\",\n      \"@type\" : \"@id\"\n    },\n    \"hobbit\" : \"http://w3id.org/hobbit/vocab#\",\n    \"@vocab\" : \"http://w3id.org/dice-research/bbdc/ontology#\",\n    \"rdf\" : \"http://www.w3.org/1999/02/22-rdf-syntax-ns#\",\n    \"xsd\" : \"http://www.w3.org/2001/XMLSchema#\",\n    \"rdfs\" : \"http://www.w3.org/2000/01/rdf-schema#\",\n    \"alg\" : \"http://www.w3id.org/dice-research/ontologies/algorithm/2023/06/\"\n  }\n}\n"
                ],
                "Isolation": "default"
            },
            "RestartPolicy": {
                "Condition": "none",
                "MaxAttempts": 0
            },
            "ForceUpdate": 0
        },
        "ServiceID": "rxg1u5xukgjxv05lzj9rlojiw",
        "Slot": 1,
        "NodeID": "4rjwfd4z8as4fja1cm4mce6re",
        "Status": {
            "Timestamp": "2023-12-15T10:25:34.315979134Z",
            "State": "rejected",
            "Message": "preparing",
            "Err": "No such image: registry.gitlab.csl.uni-bremen.de/bbdc/bbdc-2024/bbdc-hobbit-benchmark/system-faulty:latest",
            "ContainerStatus": {
                "ContainerID": "",
                "PID": 0,
                "ExitCode": 0
            },
            "PortStatus": {}
        },
        "DesiredState": "shutdown",
        "NetworksAttachments": [
            {
                "Network": {
                    "ID": "ufu6tp01zynmeceemu1ymz2va",
                    "Version": {
                        "Index": 4654
                    },
                    "CreatedAt": "2023-09-28T14:51:42.189735809Z",
                    "UpdatedAt": "2023-12-08T10:32:38.121230531Z",
                    "Spec": {
                        "Name": "hobbit",
                        "Labels": {},
                        "DriverConfiguration": {
                            "Name": "overlay"
                        },
                        "Attachable": true,
                        "IPAMOptions": {
                            "Driver": {
                                "Name": "default"
                            },
                            "Configs": [
                                {
                                    "Subnet": "172.16.100.0/24",
                                    "Gateway": "172.16.100.1"
                                }
                            ]
                        },
                        "Scope": "swarm"
                    },
                    "DriverState": {
                        "Name": "overlay",
                        "Options": {
                            "com.docker.network.driver.overlay.vxlanid_list": "4098"
                        }
                    },
                    "IPAMOptions": {
                        "Driver": {
                            "Name": "default"
                        },
                        "Configs": [
                            {
                                "Subnet": "172.16.100.0/24",
                                "Gateway": "172.16.100.1"
                            }
                        ]
                    }
                },
                "Addresses": [
                    "172.16.100.109/24"
                ]
            }
        ],
        "Volumes": null
    }
]
denkv commented 9 months ago

Is this inspect for a service?

We can try to detect this situation at that point where we already check for some failures: https://github.com/hobbit-project/platform/blob/e8c1d84fd70e2d80e879bf81c1b4c2a7b1e949f6/platform-controller/src/main/java/org/hobbit/controller/docker/ContainerManagerImpl.java#L511-L517