mesosphere / marathon

Deploy and manage containers (including Docker) on top of Apache Mesos at scale.
https://mesosphere.github.io/marathon/
Apache License 2.0
4.07k stars 845 forks source link

Error when upgrading from Marathon 1.4 snap 27 to Marathon 1.4 RC3 #4898

Closed armandgrillet closed 7 years ago

armandgrillet commented 7 years ago

I have upgraded a DC/OS cluster, it was previously running Marathon 1.4 snap 27 and now Marathon 1.4 RC3. I am now unable to see the services from the DC/OS UI and I get this message when I display the webpage of the Marathon endpoint:

Error fetching apps. Refresh to try again.

When checking the Marathon logs on the Master by doing a journalctl -u dcos-marathon I see this error:

Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: [2016-12-21 13:22:47,162] INFO  InstanceTrackerActor is starting. Task loading initiated. (mesosphere.marathon.core.task.tracker.impl.InstanceTrackerActor:marathon-akka.actor.default-dispatcher-15)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: [2016-12-21 13:22:47,166] INFO  About to load 30 tasks (mesosphere.marathon.core.task.tracker.impl.InstancesLoaderImpl:ForkJoinPool-2-worker-55)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: [2016-12-21 13:22:47,172] ERROR while loading tasks (akka.actor.OneForOneStrategy:marathon-akka.actor.default-dispatcher-5)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: java.lang.IllegalStateException: while loading tasks
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at mesosphere.marathon.core.task.tracker.impl.InstanceTrackerActor$$anonfun$initializing$1.applyOrElse(InstanceTrackerActor.scala:111)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at akka.actor.Actor$class.aroundReceive(Actor.scala:484)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at mesosphere.marathon.core.task.tracker.impl.InstanceTrackerActor.aroundReceive(InstanceTrackerActor.scala:70)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at akka.actor.ActorCell.invoke(ActorCell.scala:495)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at akka.dispatch.Mailbox.run(Mailbox.scala:224)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: Caused by: play.api.libs.json.JsResultException: JsResultException(errors:List((/tasksMap/memory-pressure-task.66bc21eb-c3d3-11e6-bf7b-70b3d5800002/reservation,List(ValidationError(List(error.path.missing),WrappedArray()))), (/tasksMap/memory-pressure-task.66bc21eb-c3d3-11e6-bf7b-70b3d5800002/status/networkInfo,List(ValidationError(List(error.path.missing),WrappedArray()))))
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at play.api.libs.json.JsReadable$$anonfun$2.apply(JsReadable.scala:23)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at play.api.libs.json.JsReadable$$anonfun$2.apply(JsReadable.scala:23)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at play.api.libs.json.JsResult$class.fold(JsResult.scala:73)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at play.api.libs.json.JsError.fold(JsResult.scala:13)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at play.api.libs.json.JsReadable$class.as(JsReadable.scala:21)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at play.api.libs.json.JsObject.as(JsValue.scala:76)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at mesosphere.marathon.storage.store.ZkStoreSerialization$$anonfun$8.apply(ZkStoreSerialization.scala:82)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at mesosphere.marathon.storage.store.ZkStoreSerialization$$anonfun$8.apply(ZkStoreSerialization.scala:80)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at akka.http.scaladsl.unmarshalling.Unmarshaller$$anonfun$strict$1$$anonfun$apply$13.apply(Unmarshaller.scala:62)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at akka.http.scaladsl.unmarshalling.Unmarshaller$$anonfun$strict$1$$anonfun$apply$13.apply(Unmarshaller.scala:62)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at akka.http.scaladsl.unmarshalling.Unmarshaller$$anon$1.apply(Unmarshaller.scala:55)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at akka.http.scaladsl.unmarshalling.Unmarshal.to(Unmarshal.scala:19)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at mesosphere.marathon.core.storage.store.impl.BasePersistenceStore$stateMachine$macro$142$1.apply(BasePersistenceStore.scala:89)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at mesosphere.marathon.core.storage.store.impl.BasePersistenceStore$stateMachine$macro$142$1.apply(BasePersistenceStore.scala:85)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
Dec 21 13:22:47 ip-10-10-0-154 marathon.sh[18946]: at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)

On Marathon we can see that the root nodes are there but not the children:

[zk: localhost:2181(CONNECTED) 20] get /marathon/state/instance/f/memory-pressure-task.marathon-66bc21eb-c3d3-11e6-bf7b-70b3d5800002
{"instanceId":{"idString":"memory-pressure-task.marathon-66bc21eb-c3d3-11e6-bf7b-70b3d5800002"},"agentInfo":{"host":"10.10.0.180","agentId":"cacbb9b5-e870-4a84-a495-bc8036bab4dc-S33","attributes":[]},"state":{"condition":{"str":"running"},"since":"2016-12-16T21:05:41.605Z","activeSince":"2016-12-16T21:05:41.605Z"},"tasksMap":{"memory-pressure-task.66bc21eb-c3d3-11e6-bf7b-70b3d5800002":{"taskId":"memory-pressure-task.66bc21eb-c3d3-11e6-bf7b-70b3d5800002","agentInfo":{"host":"10.10.0.180","agentId":"cacbb9b5-e870-4a84-a495-bc8036bab4dc-S33","attributes":[]},"runSpecVersion":"2016-12-13T15:06:11.146Z","status":{"stagedAt":"2016-12-16T21:05:41.050Z","startedAt":"2016-12-16T21:05:41.605Z","mesosStatus":"CjsKOW1lbW9yeS1wcmVzc3VyZS10YXNrLjY2YmMyMWViLWMzZDMtMTFlNi1iZjdiLTcwYjNkNTgwMDAwMhABKioKKGNhY2JiOWI1LWU4NzAtNGE4NC1hNDk1LWJjODAzNmJhYjRkYy1TMzMxAqFmyRUV1kE6Owo5bWVtb3J5LXByZXNzdXJlLXRhc2suNjZiYzIxZWItYzNkMy0xMWU2LWJmN2ItNzBiM2Q1ODAwMDAySAJaELOnIrECDU+YjUzN4Of756VqPQoPKg0SCzEwLjEwLjAuMTgwGNv8ASImCiRlMzE4YjgyZS1hOTFiLTRlOTEtYTU3OC1hMTllZjBiMWM2Yjg=","condition":{"str":"running"}},"hostPorts":[29646]}},"runSpecVersion":"2016-12-13T15:06:11.146Z"}
cZxid = 0x10005174f
ctime = Fri Dec 16 21:05:41 GMT 2016
mZxid = 0x100051750
mtime = Fri Dec 16 21:05:41 GMT 2016
pZxid = 0x10005174f
cversion = 0
dataVersion = 1
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 1155
numChildren = 0
[zk: localhost:2181(CONNECTED) 21]
jeschkies commented 7 years ago

@meichstedt this looks like the issue with migration on the network info you've mentioned. Is this it?

meichstedt commented 7 years ago

Yes, it is. Unfortunately we do currently only support migrations from release to release. This excludes RCs, snapshots and commits. Far from optimal :|