mesosphere / marathon

Deploy and manage containers (including Docker) on top of Apache Mesos at scale.
https://mesosphere.github.io/marathon/
Apache License 2.0
4.07k stars 845 forks source link

Marathon still sending event callbacks to deleted subscribers #777

Closed benley closed 9 years ago

benley commented 9 years ago

This morning I noticed that I had some stale hosts registered in eventSubscriptions, so I DELETEd them:


stale_urls=(
  http://ec2-54-165-125-93.compute-1.amazonaws.com:8000/api/marathon/event_callback
  http://ec2-54-165-159-191.compute-1.amazonaws.com:8000/api/marathon/event_callback
  http://ec2-54-165-161-54.compute-1.amazonaws.com:8000/api/marathon/event_callback
  http://ec2-54-172-158-186.compute-1.amazonaws.com:8000/api/marathon/event_callback
  http://ec2-54-172-44-67.compute-1.amazonaws.com:8000/api/marathon/event_callback
  http://ec2-54-86-189-7.compute-1.amazonaws.com:8000/api/marathon/event_callback
  http://ip-172-21-46-66.ec2.internal:8000/api/marathon/event_callback
)

for URL in "${stale_urls[@]}"; do
  http DELETE "http://my_marathon_frontend/v2/eventSubscriptions?callbackUrl=${URL}"
done

Output snippet:

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 197
Content-Type: application/json
Server: Jetty(8.y.z-SNAPSHOT)

{
    "callbackUrl": "http://ec2-54-165-161-54.compute-1.amazonaws.com:8000/api/marathon/event_callback",
    "clientIp": "172.21.71.160",
    "eventType": "unsubscribe_event",
    "timestamp": "2014-11-10T23:06:00.707Z"
}
# etc

After unsubscribing the stale URLs, they were gone from the list:

$ curl http://my_marathon_frontend/v2/eventSubscriptions
{
  "callbackUrls": [
    "http://ec2-54-173-43-83.compute-1.amazonaws.com:8000/api/marathon/event_callback",
    "http://ec2-54-173-62-16.compute-1.amazonaws.com:8000/api/marathon/event_callback",
    "http://ec2-54-88-19-137.compute-1.amazonaws.com:8000/api/marathon/event_callback"
  ]
}

But in syslog 15 and 30 minutes later, marathon was still printing errors about trying to connect to the dead hosts:

Nov 10 23:11:55 ec2-54-173-62-16 marathon[24838]: [WARN] [11/10/2014 23:11:55.334] [marathon-akka.actor.default-dispatcher-10] [akka://marathon/user/IO-HTTP/host-connector-11/1] Connection attempt to ip-172-21-46-66.ec2.internal:8000 failed in response to POST request to /api/marathon/event_callback with no retries left, dispatching error...
Nov 10 23:11:55 ec2-54-173-62-16 marathon[24838]: [WARN] [11/10/2014 23:11:55.334] [marathon-akka.actor.default-dispatcher-12] [akka://marathon/user/IO-HTTP/host-connector-11/0] Connection attempt to ip-172-21-46-66.ec2.internal:8000 failed in response to POST request to /api/marathon/event_callback with no retries left, dispatching error...
Nov 10 23:11:55 ec2-54-173-62-16 marathon[24838]: [WARN] [11/10/2014 23:11:55.334] [marathon-akka.actor.default-dispatcher-8] [akka://marathon/user/IO-HTTP/host-connector-11/2] Connection attempt to ip-172-21-46-66.ec2.internal:8000 failed in response to POST request to /api/marathon/event_callback with no retries left, dispatching error...
Nov 10 23:11:55 ec2-54-173-62-16 marathon[24838]: [WARN] [11/10/2014 23:11:55.335] [marathon-akka.actor.default-dispatcher-8] [akka://marathon/user/IO-HTTP/host-connector-11/3] Connection attempt to ip-172-21-46-66.ec2.internal:8000 failed in response to POST request to /api/marathon/event_callback with no retries left, dispatching error...

# <snip>

Nov 10 23:28:11 ec2-54-173-62-16 marathon[24838]: [WARN] [11/10/2014 23:28:11.455] [marathon-akka.actor.default-dispatcher-10] [akka://marathon/user/IO-HTTP/host-connector-12/3] Connection attempt to ec2-54-86-189-7.compute-1.amazonaws.com:8000 failed in response to POST request to /api/marathon/event_callback with 5 retries left, retrying...

# etc.

I'm running marathon 0.7.3 in HA mode. There is a haproxy instance in front of marathon's http interface, so sequential http requests may not be sent to the same marathon host - though the non-master instances appear to successfully proxy those requests to the master.

Marathon commandline:

java -Xmx512m \
  -Djava.library.path=/usr/local/lib \
  -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n \
  -cp /usr/local/sbin/marathon mesosphere.marathon.Main \
  --event_subscriber http_callback \
  --http_port 8080 \
  --checkpoint \
  --task_launch_timeout 180000 \
  --zk zk://<zookeeper_ips:2181>/marathon \
  --master zk://<zookeeper_ips:2181>/mesos

After restarting the master marathon instance, one of the standby instances became master and the stale notifications appear to have stopped.

ConnorDoyle commented 9 years ago

Thanks for reporting, will try to repro this.

everpeace commented 9 years ago

The reason might be spray's retries. I couldn't find where marathon configures spray's HostConnector, though. Even after user deletes some urls from subscribers list, post future could retry to deleted url.

ConnorDoyle commented 9 years ago

Tried to reproduce this at some length, but we were unable to make it happen on our test cluster. Are you still seeing this problem?

drexin commented 9 years ago

@benley

This may be related to a leadership issue and has probably been fixed with this patch: https://github.com/mesosphere/marathon/pull/953. Do you still see this with current master?

benley commented 9 years ago

Haven't had a chance to test yet - will get back to you asap.

drexin commented 9 years ago

@benley Any news on this?

ConnorDoyle commented 9 years ago

Closing this for now, please comment if this comes up again.

benley commented 9 years ago

Oops, forgot to follow up with this. I am not seeing this issue anymore with marathon 0.7.6, so closing this is good.

ConnorDoyle commented 9 years ago

Thanks!