mesosphere / universe

The Mesosphere Universe package repository.
http://mesosphere.github.io/universe
Apache License 2.0
305 stars 425 forks source link

Validate presence of framework-name property in config.json when is_framework=true #146

Open ConnorDoyle opened 9 years ago

ConnorDoyle commented 9 years ago

cc @jsancio @BenWhitehead

BenWhitehead commented 9 years ago

I think this an okay idea, though it may cause an issue for cassandra which configures framework name based on the cluster name.

For example the user can specify cluster.name = prod which will result in a framework name in mesos of cassandra.prod

ConnorDoyle commented 9 years ago

Based on how the DCOS UI maps registered frameworks to Marathon processes, there's no way around this (otherwise your service has 0 chance of appearing "healthy" in the UI). Since that's the case, it should be required to avoid massive confusion.

ConnorDoyle commented 9 years ago

The alternative for the UI is to change how that mapping is done. That's outside the scope of this repo though. cc @mlunoe && @rcorral.

jsancio commented 9 years ago

FYI, We also need framework-name to be there given how uninstall works.

ConnorDoyle commented 9 years ago

Good point.

rcorral commented 9 years ago

Just for the record, if there's a DCOS_PACKAGE_FRAMEWORK_NAME we assume that the marathon app is a framework.

ConnorDoyle commented 9 years ago

@rcorral -- understood. That's something that doesn't live in this repo though (it's added as a label to the Marathon app), so the purpose of this required config parameter is to have a uniform way to communicate the value to the client (dcos-cli).

BenWhitehead commented 9 years ago

I'm spinning up a cluster with 3 instances of the cassandra framework running so that we can test some of these things out before I roll it back.

mlunoe commented 9 years ago

(removed)

BenWhitehead commented 9 years ago

In the case of cassandra I think think things are handled rather well.

Install

I created a cluster and installed 3 different instances of the cassandra service.

  1. Default cassandra.dcos
  2. Configured cassandra.cluster-name to be test1
  3. Configured cassandra.cluster-name to be test2

As can be seen in the screenshot below all three instances started by marathon cleanly: image

Health Checks

It can also be seen that the health check correlation is happening correctly, and was verified using the following script:

for s in dcos test1 test2; do http --print=HhBb --pretty=colors http://benw-19-elasticloa-a0ax8we91ngl-695045825.us-west-2.elb.amazonaws.com/service/cassandra.$s/health/cluster/report;done

Which output:

GET /service/cassandra.dcos/health/cluster/report HTTP/1.1
User-Agent: HTTPie/0.9.0
Accept: */*
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: benw-19-elasticloa-a0ax8we91ngl-695045825.us-west-2.elb.amazonaws.com

HTTP/1.1 200 OK
Content-Type: application/json
Date: Wed, 03 Jun 2015 00:20:12 GMT
Server: openresty/1.7.10.1
Content-Length: 592
Connection: keep-alive

{"healthy":true,"results":[{"name":"nodeCount","ok":true,"expected":3,"actual":3},{"name":"seedCount","ok":true,"expected":2,"actual":2},{"name":"allHealthy","ok":true,"expected":[true,true,true],"actual":[true,true,true]},{"name":"operatingModeNormal","ok":true,"expected":["NORMAL","NORMAL","NORMAL"],"actual":["NORMAL","NORMAL","NORMAL"]},{"name":"lastHealthCheckNewerThan","ok":true,"expected":[1433290512948,1433290512948,1433290512948],"actual":[1433290763791,1433290765815,1433290765998]},{"name":"nodesHaveServerTask","ok":true,"expected":[true,true,true],"actual":[true,true,true]}]}

GET /service/cassandra.test1/health/cluster/report HTTP/1.1
User-Agent: HTTPie/0.9.0
Accept: */*
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: benw-19-elasticloa-a0ax8we91ngl-695045825.us-west-2.elb.amazonaws.com

HTTP/1.1 200 OK
Content-Type: application/json
Date: Wed, 03 Jun 2015 00:20:13 GMT
Server: openresty/1.7.10.1
Content-Length: 570
Connection: keep-alive

{"healthy":false,"results":[{"name":"nodeCount","ok":true,"expected":3,"actual":3},{"name":"seedCount","ok":true,"expected":2,"actual":2},{"name":"allHealthy","ok":false,"expected":[true,true,true],"actual":[true,true]},{"name":"operatingModeNormal","ok":false,"expected":["NORMAL","NORMAL","NORMAL"],"actual":["NORMAL","NORMAL"]},{"name":"lastHealthCheckNewerThan","ok":false,"expected":[1433290513143,1433290513143,1433290513143],"actual":[1433290781614,1433290789801]},{"name":"nodesHaveServerTask","ok":false,"expected":[true,true,true],"actual":[true,false,true]}]}

GET /service/cassandra.test2/health/cluster/report HTTP/1.1                                                                                                                                                                                                                              
Host: benw-19-elasticloa-a0ax8we91ngl-695045825.us-west-2.elb.amazonaws.com                                                                                                                                                                                                              
Accept: */*                                                                                                                                                                                                                                                                              
Connection: keep-alive                                                                                                                                                                                                                                                                   
User-Agent: HTTPie/0.9.0                                                                                                                                                                                                                                                                 
Accept-Encoding: gzip, deflate                                                                                                                                                                                                                                                           

HTTP/1.1 200 OK                                                                                                                                                                                                                                                                          
Content-Type: application/json                                                                                                                                                                                                                                                           
Date: Wed, 03 Jun 2015 00:20:13 GMT                                                                                                                                                                                                                                                      
Server: openresty/1.7.10.1                                                                                                                                                                                                                                                               
Content-Length: 592                                                                                                                                                                                                                                                                      
Connection: keep-alive                                                                                                                                                                                                                                                                   

{"healthy":true,"results":[{"name":"nodeCount","ok":true,"expected":3,"actual":3},{"name":"seedCount","ok":true,"expected":2,"actual":2},{"name":"allHealthy","ok":true,"expected":[true,true,true],"actual":[true,true,true]},{"name":"operatingModeNormal","ok":true,"expected":["NORMAL","NORMAL","NORMAL"],"actual":["NORMAL","NORMAL","NORMAL"]},{"name":"lastHealthCheckNewerThan","ok":true,"expected":[1433290513357,1433290513357,1433290513357],"actual":[1433290787673,1433290792903,1433290797275]},{"name":"nodesHaveServerTask","ok":true,"expected":[true,true,true],"actual":[true,true,true]}]}

Uninstall

When attempting to uninstall cassandra with:

dcos package uninstall cassandra

I am met with the following output:

Multiple instances of app [cassandra] are installed. Please specify the app id of the instance to uninstall or uninstall all. The app ids of the installed package instances are: [/cassandra/dcos, /cassandra/test1, /cassandra/test2].

Then when I run:

dcos package uninstall --app-id=/cassandra/test2 cassandra

The framework scheduler process in marathon is correctly stopped (the framework isn't shut down yet because we haven't had a new release of the cli with the functionality).

Summary

I definitely agree that we will need some sort of control in place to make sure that the cli can function well, but I don't know if the requirement to have a framework-name property is the correct way to go. In the case of cassandra marathon app id, framework name, executor ID and task ID are all correlated with the cassandra cluster that has been configured so things work out well. Perhaps there are other frameworks that are less disciplined in how these things are approached and this is something we should look into building tooling/documentation to direct service implementers toward.