hobbit-project / platform

HOBBIT benchmarking platform
GNU General Public License v2.0
23 stars 9 forks source link

Possible reason of hanging up during experiment cancel #365

Open smirnp opened 5 years ago

smirnp commented 5 years ago

Hi!

This error I got then was failed to cancel the experiment:

java.lang.NullPointerException: path is 'null'.
    at jersey.repackaged.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:226)
    at org.glassfish.jersey.client.JerseyWebTarget.path(JerseyWebTarget.java:151)
    at org.glassfish.jersey.client.JerseyWebTarget.path(JerseyWebTarget.java:59)
    at com.spotify.docker.client.DefaultDockerClient.inspectTask(DefaultDockerClient.java:1980)
    at org.hobbit.controller.docker.ContainerManagerImpl.removeContainer(ContainerManagerImpl.java:606)
    at org.hobbit.controller.docker.ContainerManagerImpl.removeParentAndChildren(ContainerManagerImpl.java:660)
    at org.hobbit.controller.ExperimentManager.forceBenchmarkTerminate_unsecured(ExperimentManager.java:563)
    at org.hobbit.controller.ExperimentManager.stopExperimentIfRunning(ExperimentManager.java:816)
    at org.hobbit.controller.PlatformController.handleFrontEndCmd(PlatformController.java:828)
    at org.hobbit.controller.front.FrontEndApiHandler$MsgProcessingTask.run(FrontEndApiHandler.java:76)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
smirnp commented 5 years ago

The platform also hangs up for ~10minutes (when GUI says "Failed Unable to remove experiment 1541700297545") having the following logs:

2018-11-08 19:18:51,279 ERROR [org.hobbit.controller.ExperimentManager] - <The experiment http://w3id.org/hobbit/experiments#1541700297545 was stopped by the user. Forcing termination.>
2018-11-08 19:18:51,311 WARN [org.hobbit.controller.docker.ContainerManagerImpl] - <Container for task qpduw750i1dmv8y3nvaxba4b1 has no exit code, assuming 0>
2018-11-08 19:18:51,311 INFO [org.hobbit.controller.docker.ContainerManagerImpl] - <Removing service of container with task id qpduw750i1dmv8y3nvaxba4b1. >
2018-11-08 19:18:51,645 WARN [org.hobbit.controller.docker.ContainerManagerImpl] - <Container for task js0fptg3fcmamw29hfbxsb4j0 has no exit code, assuming 0>
2018-11-08 19:18:51,646 INFO [org.hobbit.controller.docker.ContainerManagerImpl] - <Removing service of container with task id js0fptg3fcmamw29hfbxsb4j0. >
2018-11-08 19:18:51,946 WARN [org.hobbit.controller.docker.ContainerManagerImpl] - <Container for task m25536ri7s46d7scbj93m2cnz has no exit code, assuming 0>
2018-11-08 19:18:51,947 INFO [org.hobbit.controller.docker.ContainerManagerImpl] - <Removing service of container with task id m25536ri7s46d7scbj93m2cnz. >
2018-11-08 19:18:52,237 WARN [org.hobbit.controller.docker.ContainerManagerImpl] - <Container for task q9qqxld9lcvyy498g42p865pg has no exit code, assuming 0>
2018-11-08 19:18:52,238 INFO [org.hobbit.controller.docker.ContainerManagerImpl] - <Removing service of container with task id q9qqxld9lcvyy498g42p865pg. >
2018-11-08 19:18:52,535 WARN [org.hobbit.controller.docker.ContainerManagerImpl] - <Container for task v76rglb9xjcq9xk39k2x0weoy has no exit code, assuming 0>
2018-11-08 19:18:52,535 INFO [org.hobbit.controller.docker.ContainerManagerImpl] - <Removing service of container with task id v76rglb9xjcq9xk39k2x0weoy. >
2018-11-08 19:19:06,266 INFO [org.hobbit.controller.docker.ContainerStateObserverImpl] - <Couldn't get the status of container js0fptg3fcmamw29hfbxsb4j0. Assuming it was stopped by the platform.>
2018-11-08 19:19:06,266 INFO [org.hobbit.controller.PlatformController] - <Container js0fptg3fcmamw29hfbxsb4j0 stopped with exitCode=137>
2018-11-08 19:19:06,267 INFO [org.hobbit.controller.ExperimentManager] - <Sending broadcast message...>
2018-11-08 19:19:06,328 INFO [org.hobbit.controller.ExperimentManager] - <Unknown container js0fptg3fcmamw29hfbxsb4j0 stopped with exitCode=137>
2018-11-08 19:19:06,402 WARN [org.hobbit.controller.docker.ContainerManagerImpl] - <Couldn't remove container js0fptg3fcmamw29hfbxsb4j0 because it doesn't exist>

The mentioned containers crashed by their own before I cancel the experiment in GUI.

smirnp commented 5 years ago

Seems that experimentTimeout is not flushing after this and the platform is not checking queue for new experiments for ~10 minutes.