gridpocket / ansible-mesos-cluster

4 stars 2 forks source link

you need check marathon log first. #1

Open xiaods opened 9 years ago

xiaods commented 9 years ago

When deploying a task from the marathon UI, the deployment hangs forever if mesos master is not the first one of the list (!). I'm currently digging into this bug.

gridpocketuser commented 9 years ago

I remember not having found anything particuliar in Marathon logs. I'll check once more. Did you manage to have it working ?

xiaods commented 9 years ago

@gridpocketuser I have not use your ansible to reproduce this bug. But I have successful deploy many more times for mesos cluster, so i have came across similar case. in my experience, I remember when you post a job to marathon, the marathon's trigger log is very importance hint for this lang hanging.

so you need carefully recheck the syslog for marathon. if you have not found any hints. we can alternative choice another way to check slave log. suppose you close two slave, left one slave as checkpoint. then you send aging the job to marathon, to check slave log, then you will find the cause to block the slave deployment.

gridpocketuser commented 9 years ago

@xiaods I ran the playbook several time in a row and everything went fine. I'm not able to reproduce the problem anymore. Strange....

lucj commented 9 years ago

@xiaods well, last attempt failed :( From the Mesos dashboard, only a single resource is found instead of 3.

mesos_dashboard_ko
lucj commented 9 years ago

@xiaods Next attempt seems to have installed everything fine (3 slaves in the mesos dashboard) but new tasks are stucked in the deployments state in marathon.

Task creation:

capture d ecran 2015-09-08 a 13 03 41

Task's never ending deployment

capture d ecran 2015-09-08 a 13 07 01

Checking the logs I have,

mesos-master.FATAL:

Log file created at: 2015/09/08 09:59:28
Running on machine: vagrant-ubuntu-trusty-64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0908 09:59:28.027618  8585 master.cpp:1253] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins

Also, in syslog, I have the following:

Sep  8 11:03:47 vagrant-ubuntu-trusty-64 marathon[9321]: [INFO] [09/08/2015 11:03:47.961] [marathon-akka.actor.default-dispatcher-19]     [akka://marathon/user/MarathonScheduler/$a/DeploymentManager/cff1b0c2-7306-4bb0-bdc1-4f0546dbfafd/$a] Successfully started 0 instances of /testwww
Sep  8 11:03:47 vagrant-ubuntu-trusty-64 marathon[9321]: [2015-09-08 11:03:47,904] INFO Compute DeploymentPlan from Group(/,Set(),Set(),Set(),2015-09-08T11:03:22.137Z) to Group(/,Set(AppDefinition(/testwww,Some(python -m SimpleHTTPServer $PORT),None,None,Map(),3,0.1,16.0,0.0,,Set(),List(),List(),List(10000),false,1 second,1.15,3600 seconds,None,Set(),Set(),UpgradeStrategy(1.0,1.0),Map(),None,2015-09-08T11:03:47.893Z)),Set(),Set(),2015-09-08T11:03:47.893Z) (mesosphere.marathon.upgrade.DeploymentPlan$:217)
Sep  8 11:03:47 vagrant-ubuntu-trusty-64 marathon[9321]: [2015-09-08 11:03:47,905] INFO Computed new deployment plan: DeploymentPlan(2015-09-08T11:03:47.893Z, (Step(Vector(Start(App(/testwww, Some(python -m SimpleHTTPServer $PORT))), 0))), Step(Vector(Scale(App(/testwww, Some(python -m SimpleHTTPServer $PORT))), 3, None}))))) (mesosphere.marathon.upgrade.DeploymentPlan$:271)
Sep  8 11:03:47 vagrant-ubuntu-trusty-64 marathon[9321]: [2015-09-08 11:03:47,906] INFO Deploy plan:DeploymentPlan(2015-09-08T11:03:47.893Z, (Step(Vector(Start(App(/testwww, Some(python -m SimpleHTTPServer $PORT))), 0))), Step(Vector(Scale(App(/testwww, Some(python -m SimpleHTTPServer $PORT))), 3, None}))))) with force:false (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$e4a8e89a:104)
Sep  8 11:03:47 vagrant-ubuntu-trusty-64 marathon[9321]: [2015-09-08 11:03:47,916] INFO Compute DeploymentPlan from Group(/,Set(),Set(),Set(),2015-09-08T11:03:22.137Z) to Group(/,Set(AppDefinition(/testwww,Some(python -m SimpleHTTPServer $PORT),None,None,Map(),3,0.1,16.0,0.0,,Set(),Vector(),Vector(),Vector(10000),false,1000 milliseconds,1.15,3600000 milliseconds,None,Set(),Set(),UpgradeStrategy(1.0,1.0),Map(),None,2015-09-08T11:03:47.893Z)),Set(),Set(),2015-09-08T11:03:47.893Z) (mesosphere.marathon.upgrade.DeploymentPlan$:217)
Sep  8 11:03:47 vagrant-ubuntu-trusty-64 marathon[9321]: [2015-09-08 11:03:47,923] INFO Computed new deployment plan: DeploymentPlan(2015-09-08T11:03:47.893Z, (Step(Vector(Start(App(/testwww, Some(python -m SimpleHTTPServer $PORT))), 0))), Step(Vector(Scale(App(/testwww, Some(python -m SimpleHTTPServer $PORT))), 3, None}))))) (mesosphere.marathon.upgrade.DeploymentPlan$:271)
Sep  8 11:03:47 vagrant-ubuntu-trusty-64 marathon[9321]: [2015-09-08 11:03:47,936] INFO Deployment acknowledged. Waiting to get processed: DeploymentPlan(2015-09-08T11:03:47.893Z, (Step(Vector(Start(App(/testwww, Some(python -m SimpleHTTPServer $PORT))), 0))), Step(Vector(Scale(App(/testwww, Some(python -m SimpleHTTPServer $PORT))), 3, None}))))) (mesosphere.marathon.state.GroupManager:153)
Sep  8 11:03:47 vagrant-ubuntu-trusty-64 marathon[9321]: [2015-09-08 11:03:47,936] INFO Compute DeploymentPlan from Group(/,Set(),Set(),Set(),2015-09-08T11:03:22.137Z) to Group(/,Set(AppDefinition(/testwww,Some(python -m SimpleHTTPServer $PORT),None,None,Map(),3,0.1,16.0,0.0,,Set(),Vector(),Vector(),Vector(10000),false,1000 milliseconds,1.15,3600000 milliseconds,None,Set(),Set(),UpgradeStrategy(1.0,1.0),Map(),None,2015-09-08T11:03:47.893Z)),Set(),Set(),2015-09-08T11:03:47.893Z) (mesosphere.marathon.upgrade.DeploymentPlan$:217)
Sep  8 11:03:47 vagrant-ubuntu-trusty-64 marathon[9321]: [2015-09-08 11:03:47,938] INFO Computed new deployment plan: DeploymentPlan(2015-09-08T11:03:47.893Z, (Step(Vector(Start(App(/testwww, Some(python -m SimpleHTTPServer $PORT))), 0))), Step(Vector(Scale(App(/testwww, Some(python -m SimpleHTTPServer $PORT))), 3, None}))))) (mesosphere.marathon.upgrade.DeploymentPlan$:271)
Sep  8 11:03:47 vagrant-ubuntu-trusty-64 marathon[9321]: [INFO] [09/08/2015 11:03:47.961] [marathon-akka.actor.default-dispatcher-19] [akka://marathon/user/MarathonScheduler/$a/DeploymentManager/cff1b0c2-7306-4bb0-bdc1-4f0546dbfafd/$a] Successfully started 0 instances of /testwww
Sep  8 11:03:51 vagrant-ubuntu-trusty-64 marathon[9321]: [2015-09-08 11:03:51,455] INFO 192.168.1.107 -  -  [08/Sep/2015:11:03:51 +0000] "GET /v2/apps//testwww HTTP/1.1" 200 678 "http://192.168.1.211:8080/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36" (mesosphere.chaos.http.ChaosRequestLog$$EnhancerByGuice$$5b95268a:15)
Sep  8 11:03:51 vagrant-ubuntu-trusty-64 marathon[9321]: [2015-09-08 11:03:51,461] INFO 192.168.1.107 -  -  [08/Sep/2015:11:03:51 +0000] "GET /v2/apps//testwww/versions HTTP/1.1" 200 41 "http://192.168.1.211:8080/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36" (mesosphere.chaos.http.ChaosRequestLog$$EnhancerByGuice$$5b95268a:15)

I cannot really tell what's wrong from those logs.

The strange thing is that this problem does not occur on each new deployment of the cluster. Sometimes the deployment is working totally fine and if I try the whole procedure once again it can fail (or not). Any ideas are more than welcome.