Open MichaelRoeder opened 11 months ago
The system in the example doesn't really crash. Instead, it looks like the image of the system is unknown (because of a typo) and there is no container that starts at all. While the Docker service seems to be aware of the problem (see inspect dump further below), the platform doesn't seem to recognize that the image does not exist (Note: AUTO_PULL has been turned off).
micha@mprec:~/workspace/platform$ docker inspect wbjke9en2wfp
[
{
"ID": "wbjke9en2wfpj3k1hlrizajsk",
"Version": {
"Index": 4963
},
"CreatedAt": "2023-12-15T10:25:33.581954821Z",
"UpdatedAt": "2023-12-15T10:25:34.531438292Z",
"Labels": {},
"Spec": {
"ContainerSpec": {
"Image": "registry.gitlab.csl.uni-bremen.de/bbdc/bbdc-2024/bbdc-hobbit-benchmark/system-faulty",
"Labels": {
"org.hobbit.parent": "rabbitm-12af1ecf439e4c7a8ac21316ef5e3425",
"org.hobbit.type": "system"
},
"Hostname": "system-faulty-3f4859075e3d415ab65c4f86f2e51893",
"Env": [
"HOBBIT_CONTAINER_NAME=system-faulty-3f4859075e3d415ab65c4f86f2e51893",
"HOBBIT_HARDWARE_NODES=1",
"HOBBIT_HARDWARE_NODES_SYSTEM=0",
"HOBBIT_HARDWARE_NODES_BENCHMARK=0",
"HOBBIT_RABBIT_HOST=rabbitm-12af1ecf439e4c7a8ac21316ef5e3425",
"HOBBIT_SESSION_ID=1702635354281",
"SYSTEM_PARAMETERS_MODEL={\n \"@graph\" : [ {\n \"@id\" : \"http://w3id.org/dice-research/bbdc/ontology#FaultyExampleSystem\",\n \"@type\" : \"hobbit:SystemInstance\",\n \"imageName\" : \"registry.gitlab.csl.uni-bremen.de/bbdc/bbdc-2024/bbdc-hobbit-benchmark/system-faulty\",\n \"implementsAPI\" : \"http://w3id.org/dice-research/bbdc/ontology#API\",\n \"comment\" : {\n \"@language\" : \"en\",\n \"@value\" : \"An example system that will crash.\"\n },\n \"label\" : {\n \"@language\" : \"en\",\n \"@value\" : \"Faulty System\"\n }\n }, {\n \"@id\" : \"http://w3id.org/dice-research/bbdc/ontology#ModelLoadingError\",\n \"comment\" : \"An error occurred while trying to load the model.\",\n \"label\" : \"Model loading error\",\n \"subClassOf\" : \"alg:Error\"\n } ],\n \"@context\" : {\n \"label\" : {\n \"@id\" : \"http://www.w3.org/2000/01/rdf-schema#label\"\n },\n \"comment\" : {\n \"@id\" : \"http://www.w3.org/2000/01/rdf-schema#comment\"\n },\n \"implementsAPI\" : {\n \"@id\" : \"http://w3id.org/hobbit/vocab#implementsAPI\",\n \"@type\" : \"@id\"\n },\n \"imageName\" : {\n \"@id\" : \"http://w3id.org/hobbit/vocab#imageName\"\n },\n \"subClassOf\" : {\n \"@id\" : \"http://www.w3.org/2000/01/rdf-schema#subClassOf\",\n \"@type\" : \"@id\"\n },\n \"hobbit\" : \"http://w3id.org/hobbit/vocab#\",\n \"@vocab\" : \"http://w3id.org/dice-research/bbdc/ontology#\",\n \"rdf\" : \"http://www.w3.org/1999/02/22-rdf-syntax-ns#\",\n \"xsd\" : \"http://www.w3.org/2001/XMLSchema#\",\n \"rdfs\" : \"http://www.w3.org/2000/01/rdf-schema#\",\n \"alg\" : \"http://www.w3id.org/dice-research/ontologies/algorithm/2023/06/\"\n }\n}\n"
],
"Isolation": "default"
},
"RestartPolicy": {
"Condition": "none",
"MaxAttempts": 0
},
"ForceUpdate": 0
},
"ServiceID": "rxg1u5xukgjxv05lzj9rlojiw",
"Slot": 1,
"NodeID": "4rjwfd4z8as4fja1cm4mce6re",
"Status": {
"Timestamp": "2023-12-15T10:25:34.315979134Z",
"State": "rejected",
"Message": "preparing",
"Err": "No such image: registry.gitlab.csl.uni-bremen.de/bbdc/bbdc-2024/bbdc-hobbit-benchmark/system-faulty:latest",
"ContainerStatus": {
"ContainerID": "",
"PID": 0,
"ExitCode": 0
},
"PortStatus": {}
},
"DesiredState": "shutdown",
"NetworksAttachments": [
{
"Network": {
"ID": "ufu6tp01zynmeceemu1ymz2va",
"Version": {
"Index": 4654
},
"CreatedAt": "2023-09-28T14:51:42.189735809Z",
"UpdatedAt": "2023-12-08T10:32:38.121230531Z",
"Spec": {
"Name": "hobbit",
"Labels": {},
"DriverConfiguration": {
"Name": "overlay"
},
"Attachable": true,
"IPAMOptions": {
"Driver": {
"Name": "default"
},
"Configs": [
{
"Subnet": "172.16.100.0/24",
"Gateway": "172.16.100.1"
}
]
},
"Scope": "swarm"
},
"DriverState": {
"Name": "overlay",
"Options": {
"com.docker.network.driver.overlay.vxlanid_list": "4098"
}
},
"IPAMOptions": {
"Driver": {
"Name": "default"
},
"Configs": [
{
"Subnet": "172.16.100.0/24",
"Gateway": "172.16.100.1"
}
]
}
},
"Addresses": [
"172.16.100.109/24"
]
}
],
"Volumes": null
}
]
Is this inspect
for a service?
We can try to detect this situation at that point where we already check for some failures: https://github.com/hobbit-project/platform/blob/e8c1d84fd70e2d80e879bf81c1b4c2a7b1e949f6/platform-controller/src/main/java/org/hobbit/controller/docker/ContainerManagerImpl.java#L511-L517
Description
If the platform is in
develop
mode and the system crashes before the benchmark is ready the benchmark container is not terminated but the experiment times out instead. An example of the platform controller log is given below:After the last message, the platform does nothing until the experiment timer runs out.
Reason
Stopping the experiment is triggered internally by stopping all containers that belong to the experiment including the benchmark container. The implementation of the ExperimentManager assumes that it will be notified that the other containers stopped and that it cleans up the experiment based on this information. However, the Docker swarm-based ContainerManagerImpl class does not distinguish between stopping a container and removing it. Since the
develop
mode forbids the removal, the call to stop the containers actually does nothing and the ExperimentManager never receives the call that a container is stopped. Hence, it will not stop the experiment before it is timed out.Reproducability
Expected behavior
The platform should terminate the benchmark OR the log messages should inform the admin that they should remove the containers manually. The latter solution would assume that the
develop
mode is actually only used for development.