mesos / kafka

Apache Kafka on Apache Mesos
Apache License 2.0
414 stars 140 forks source link

Error: broker start timeout #155

Closed aglahe closed 9 years ago

aglahe commented 9 years ago

CoreOS 845, Docker 1.8.3, Mesos 0.25 and deploying in Marathon .11

So, i create the broker and can see that it was created via API:

$ curl http://kafka-mesos-scheduler.marathon.mesos:31700/api/broker/list
{"brokers" : [{"stickiness" : {"period" : "10m"}, "id" : "0", "port" : "31092", "mem" : 2048, "cpus" : 1.0, "heap" : 1024, "failover" : {"delay" : "1m", "maxDelay" : "10m"}, "active" : false}]}

and when i do a:

$ ./kafka-mesos.sh broker start 0
Error: broker start timeout

and I can see it "trying to start"..

$ ./kafka-mesos.sh broker list
broker:
  id: 0
  active: true
  state: starting
  resources: cpus:1.00, mem:2048, heap:1024, port:31092
  failover: delay:1m, max-delay:10m
  stickiness: period:10m

I must be missing something fundamental: the marathon JSON is:

{
  "id": "kafka-mesos-scheduler",
  "cpus": 1.0,
  "mem": 1024.0,
  "instances": 1,
  "ports": [31700],
  "cmd": "./kafka-mesos.sh scheduler --master=zk://zk01:2181,zk02:2181,zk03:2181,zk04:2181,zk05:2181/mesos --zk=zk01:2181,zk02:2181,zk03:2181,zk04:2181,zk05:2181 --api=http://kafka-mesos-scheduler.marathon.mesos:31700 --storage=zk:/kafka-mesos",
  "container": {
    "type": "DOCKER",
    "docker": {
      "image": "private-hub/kafka-mesos:0.9.3.0",
      "network": "BRIDGE",
      "portMappings": [
        { "containerPort": 31700, "hostPort": 31700, "servicePort": 31700, "protocol": "tcp" }
      ],
      "parameters": [
        { "key": "name", "value": "kafka-mesos-scheduler" }
      ]
    }
  },
  "maxLaunchDelaySeconds": 7200
}
dmitrypekar commented 9 years ago

@aglahe Please look into stdout and stderr files in the sandbox from the mesos UI. It should probably contain the error, happening during broker start. If this would not help to identify the issue, please attach those files to the ticket.

aglahe commented 9 years ago

So, i started the scheduler with --debut=true

stderr:

I1117 16:12:18.648713  2681 logging.cpp:172] INFO level logging started!
I1117 16:12:18.648988  2681 fetcher.cpp:414] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/23febc46-85d6-471b-bdce-a46da8ccdbbc-S23","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"file:\/\/\/etc\/docker.tar.gz"}}],"sandbox_directory":"\/tmp\/slaves\/23febc46-85d6-471b-bdce-a46da8ccdbbc-S23\/frameworks\/20150925-222237-83912970-5050-1-0001\/executors\/kafka-mesos-scheduler.f96eb5ae-8d45-11e5-bfe4-46f423552733\/runs\/fba1d308-1c6f-479e-ba17-94133afb71fe"}
I1117 16:12:18.651567  2681 fetcher.cpp:369] Fetching URI 'file:///etc/docker.tar.gz'
I1117 16:12:18.651590  2681 fetcher.cpp:243] Fetching directly into the sandbox directory
I1117 16:12:18.651618  2681 fetcher.cpp:180] Fetching URI 'file:///etc/docker.tar.gz'
I1117 16:12:18.651638  2681 fetcher.cpp:160] Copying resource with command:cp '/etc/docker.tar.gz' '/tmp/slaves/23febc46-85d6-471b-bdce-a46da8ccdbbc-S23/frameworks/20150925-222237-83912970-5050-1-0001/executors/kafka-mesos-scheduler.f96eb5ae-8d45-11e5-bfe4-46f423552733/runs/fba1d308-1c6f-479e-ba17-94133afb71fe/docker.tar.gz'
I1117 16:12:18.655037  2681 fetcher.cpp:76] Extracting with command: tar -C '/tmp/slaves/23febc46-85d6-471b-bdce-a46da8ccdbbc-S23/frameworks/20150925-222237-83912970-5050-1-0001/executors/kafka-mesos-scheduler.f96eb5ae-8d45-11e5-bfe4-46f423552733/runs/fba1d308-1c6f-479e-ba17-94133afb71fe' -xf '/tmp/slaves/23febc46-85d6-471b-bdce-a46da8ccdbbc-S23/frameworks/20150925-222237-83912970-5050-1-0001/executors/kafka-mesos-scheduler.f96eb5ae-8d45-11e5-bfe4-46f423552733/runs/fba1d308-1c6f-479e-ba17-94133afb71fe/docker.tar.gz'
I1117 16:12:18.661016  2681 fetcher.cpp:84] Extracted '/tmp/slaves/23febc46-85d6-471b-bdce-a46da8ccdbbc-S23/frameworks/20150925-222237-83912970-5050-1-0001/executors/kafka-mesos-scheduler.f96eb5ae-8d45-11e5-bfe4-46f423552733/runs/fba1d308-1c6f-479e-ba17-94133afb71fe/docker.tar.gz' into '/tmp/slaves/23febc46-85d6-471b-bdce-a46da8ccdbbc-S23/frameworks/20150925-222237-83912970-5050-1-0001/executors/kafka-mesos-scheduler.f96eb5ae-8d45-11e5-bfe4-46f423552733/runs/fba1d308-1c6f-479e-ba17-94133afb71fe'
I1117 16:12:18.661087  2681 fetcher.cpp:446] Fetched 'file:///etc/docker.tar.gz' to '/tmp/slaves/23febc46-85d6-471b-bdce-a46da8ccdbbc-S23/frameworks/20150925-222237-83912970-5050-1-0001/executors/kafka-mesos-scheduler.f96eb5ae-8d45-11e5-bfe4-46f423552733/runs/fba1d308-1c6f-479e-ba17-94133afb71fe/docker.tar.gz'
I1117 16:12:18.876677  2693 exec.cpp:134] Version: 0.25.0
I1117 16:12:18.881270  2728 exec.cpp:208] Executor registered on slave 23febc46-85d6-471b-bdce-a46da8ccdbbc-S23
tput: No value for $TERM and no -T specified
2015-11-17 16:12:22,768:12(0x7fa19affd700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-11-17 16:12:22,768:12(0x7fa19affd700):ZOO_INFO@log_env@716: Client environment:host.name=6be26e12cad9
2015-11-17 16:12:22,768:12(0x7fa19affd700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-11-17 16:12:22,768:12(0x7fa19affd700):ZOO_INFO@log_env@724: Client environment:os.arch=4.2.2-coreos
2015-11-17 16:12:22,768:12(0x7fa19affd700):ZOO_INFO@log_env@725: Client environment:os.version=#2 SMP Wed Oct 14 22:58:02 UTC 2015
2015-11-17 16:12:22,768:12(0x7fa19affd700):ZOO_INFO@log_env@733: Client environment:user.name=(null)
2015-11-17 16:12:22,768:12(0x7fa19affd700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-11-17 16:12:22,768:12(0x7fa19affd700):ZOO_INFO@log_env@753: Client environment:user.dir=/opt/kafka-mesos
2015-11-17 16:12:22,768:12(0x7fa19affd700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=10.105.0.1:2181,10.105.0.3:2181,10.105.0.5:2181,10.105.0.7:2181,10.105.0.9:2181 sessionTimeout=10000 watcher=0x7fa1cb3ac600 sessionId=0 sessionPasswd=<null> context=0x7f9fa4000ad0 flags=0
2015-11-17 16:12:22,769:12(0x7fa191ffb700):ZOO_INFO@check_events@1703: initiated connection to server [10.105.0.5:2181]
I1117 16:12:22.771742    13 sched.cpp:164] Version: 0.25.0
2015-11-17 16:12:22,819:12(0x7fa191ffb700):ZOO_INFO@check_events@1750: session establishment complete on server [10.105.0.5:2181], sessionId=0x35071d22f9d0023, negotiated timeout=10000
I1117 16:12:22.819651   132 group.cpp:331] Group process (group(1)@172.17.0.40:42763) connected to ZooKeeper
I1117 16:12:22.819699   132 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1117 16:12:22.819712   132 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
I1117 16:12:22.821861   152 detector.cpp:156] Detected a new leader: (id='72')
I1117 16:12:22.822012   135 group.cpp:674] Trying to get '/mesos/json.info_0000000072' in ZooKeeper
I1117 16:12:22.823520   153 detector.cpp:481] A new leading master (UPID=master@10.105.0.5:5050) is detected
I1117 16:12:22.823668   131 sched.cpp:262] New master detected at master@10.105.0.5:5050
I1117 16:12:22.823957   131 sched.cpp:272] No credentials provided. Attempting to register without authentication

stdout:

--container="mesos-23febc46-85d6-471b-bdce-a46da8ccdbbc-S23.fba1d308-1c6f-479e-ba17-94133afb71fe" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/tmp/slaves/23febc46-85d6-471b-bdce-a46da8ccdbbc-S23/frameworks/20150925-222237-83912970-5050-1-0001/executors/kafka-mesos-scheduler.f96eb5ae-8d45-11e5-bfe4-46f423552733/runs/fba1d308-1c6f-479e-ba17-94133afb71fe" --stop_timeout="0ns"
--container="mesos-23febc46-85d6-471b-bdce-a46da8ccdbbc-S23.fba1d308-1c6f-479e-ba17-94133afb71fe" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/tmp/slaves/23febc46-85d6-471b-bdce-a46da8ccdbbc-S23/frameworks/20150925-222237-83912970-5050-1-0001/executors/kafka-mesos-scheduler.f96eb5ae-8d45-11e5-bfe4-46f423552733/runs/fba1d308-1c6f-479e-ba17-94133afb71fe" --stop_timeout="0ns"
Registered docker executor on r105u33.dsra.local
Starting task kafka-mesos-scheduler.f96eb5ae-8d45-11e5-bfe4-46f423552733
2015-11-17 16:12:21,315 [main] INFO  ly.stealth.mesos.kafka.Scheduler$  - Starting Scheduler$:
debug: true, storage: zk:/kafka-mesos
mesos: master=zk://10.105.0.1:2181,10.105.0.3:2181,10.105.0.5:2181,10.105.0.7:2181,10.105.0.9:2181/mesos, user=<default>, principal=<none>, secret=<none>
framework: name=kafka, role=*, timeout=30d
api: http://kafka-mesos-scheduler.marathon.mesos:31700, bind-address: <all>, zk: 10.105.0.1:2181,10.105.0.3:2181,10.105.0.5:2181,10.105.0.7:2181,10.105.0.9:2181, jre: <none>
2015-11-17 16:12:22,108 [main] INFO  org.eclipse.jetty.server.Server  - jetty-9.0.z-SNAPSHOTWrappedArray()
2015-11-17 16:12:22,191 [main] INFO  org.eclipse.jetty.server.handler.ContextHandler  - Started WrappedArray(o.e.j.s.ServletContextHandler@6c6fa0b5{/,null,AVAILABLE})
2015-11-17 16:12:22,318 [main] INFO  org.eclipse.jetty.server.ServerConnector  - Started WrappedArray(ServerConnector@c5f5deb{HTTP/1.1}{0.0.0.0:31700})
2015-11-17 16:12:22,319 [main] INFO  ly.stealth.mesos.kafka.HttpServer$  - started on port 31700
2015-11-17 16:12:53,981 [Jetty-76] INFO  ly.stealth.mesos.kafka.HttpServer$  - handling - http://kafka-mesos-scheduler.marathon.mesos:31700/api/broker/stop
2015-11-17 16:12:54,027 [Jetty-76] INFO  ly.stealth.mesos.kafka.HttpServer$  - finished handling
2015-11-17 16:12:56,924 [Jetty-79] INFO  ly.stealth.mesos.kafka.HttpServer$  - handling - http://kafka-mesos-scheduler.marathon.mesos:31700/api/broker/stop
2015-11-17 16:12:56,929 [Jetty-79] INFO  ly.stealth.mesos.kafka.HttpServer$  - finished handling
2015-11-17 16:12:58,300 [Jetty-76] INFO  ly.stealth.mesos.kafka.HttpServer$  - handling - http://kafka-mesos-scheduler.marathon.mesos:31700/api/broker/stop
2015-11-17 16:12:58,305 [Jetty-76] INFO  ly.stealth.mesos.kafka.HttpServer$  - finished handling
2015-11-17 16:13:06,200 [Jetty-79] INFO  ly.stealth.mesos.kafka.HttpServer$  - handling - http://kafka-mesos-scheduler.marathon.mesos:31700/api/broker/list
2015-11-17 16:13:06,210 [Jetty-79] INFO  ly.stealth.mesos.kafka.HttpServer$  - finished handling
2015-11-17 16:13:11,899 [Jetty-79] INFO  ly.stealth.mesos.kafka.HttpServer$  - handling - http://kafka-mesos-scheduler.marathon.mesos:31700/api/broker/start
2015-11-17 16:14:12,372 [Jetty-79] INFO  ly.stealth.mesos.kafka.HttpServer$  - finished handling
aglahe commented 9 years ago

I know there were a few "stops," but was me just making sure th "old" broker was stopped.

dmitrypekar commented 9 years ago

@aglahe These are the out and err of the Scheduler. Could you please provide stdout and stderr of the failed broker?

aglahe commented 9 years ago

that's the problem..i don't even see tasks start up. So not really sure where I can get those...it's strange.

screen shot 2015-11-17 at 11 27 07 am

Thanks for the help btw.

dmitrypekar commented 9 years ago

Ok. The problem seems to be in the fact, that the scheduler doesn't receive offers. This in turn may be influenced by connectivity problem in direction from mesos master -> scheduler.

You can dig in master logs (accessible from web UI logs link) for more details. Also you may try to play with LIBPROCESS_IP settings, pointing to the accessible IP address of the Scheduler.

This problem isn't actually related the framework itself. It is related to some deployment problem.

aglahe commented 9 years ago

Sounds good...thanks for the help..i'll keep digging, and update this when I figure it out.

aglahe commented 9 years ago

Figured it out...i guess you cannot run the scheduler, inside a docker container, in bridge mode..it needs to be HOST mode. I now see brokers' starting up.

$ ./kafka-mesos.sh broker start 0
broker started:
  id: 0
  active: true
  state: running
  resources: cpus:1.00, mem:2048, heap:1024, port:auto
  failover: delay:1m, max-delay:10m
  stickiness: period:10m, hostname:r105u13.dsra.local
  task: 
    id: broker-0-4ffad113-da45-499e-b374-a72100e99478
    state: running
    endpoint: r105u13.dsra.local:31000
dmitrypekar commented 9 years ago

@aglahe Great! Closing the ticket then.