kubernetes-retired / contrib

[EOL] This is a place for various components in the Kubernetes ecosystem that aren't part of the Kubernetes core.
Apache License 2.0
2.46k stars 1.68k forks source link

Submit Queue / munger: keep state over restarts #1042

Closed lavalamp closed 8 years ago

lavalamp commented 8 years ago

Could be as simple as storing a static file in GCS or as complex as adding a database to the things we run in the utility cluster.

P0 Requirements:

lavalamp commented 8 years ago

@eparis @mikedanese @fejta maybe we can come up with a list of things we want to save.

@mhrgoog It'd be great if we could collect very explicitly in one place the set of things that are getting persisted.

ghost commented 8 years ago

Right now I am brainstorming and throwing ideas against the wall. I am not sure if this justifies a design doc or not. Here are three potential possibilities:

1) A protobuffer that is stored in a file. Pros: Provides backwards compatibility and parsing Cons: It's just a protobuffer. One must read in the whole structure before it can be used, concurrent operations may not be easy.

2) Key value store: We can use some key value store

Pros: Concurrent use should be easier. Not sure which one to pick and how robust they are. Maybe this is obvious to veterans of the team

3) Full on relational DB

Pros: fun with queries and ability to mine information flexibly Cons: Schema management can be a hassle.

I think even if we knew what we wanted to save now the list would change. But knowing the size of data we want would help a lot.

@fejta you have any ideas of what you are looking for?

apelisse commented 8 years ago

What problem are we trying to fix here?

fejta commented 8 years ago

@mhrgoog The data in http://submit-queue.k8s.io/#/e2e is great... except it disappears whenever we restart the merge queue. I want it to serialized to a GCS object so we can avoid that.

Specifically I need to be able to measure the following each week:

Right now I cannot because whenever we restart the mergebot everything is lost. So I wind up tracking these values since 4 hours ago instead.

At this point I am not concerned about fun with queries or concurrency. I want to be able to answer those three questions.

lavalamp commented 8 years ago

I recommend a json object over a protobuf. On May 23, 2016 10:44 PM, "Erick Fejta" notifications@github.com wrote:

@mhrgoog https://github.com/mhrgoog The data in http://submit-queue.k8s.io/#/e2e is great... except it disappears whenever we restart the merge queue. I want it to serialized to a GCS object so we can avoid that.

Specifically I need to be able to measure the following each week:

  • What percentage of the time was the merge bot healthy last week?
  • Which job was the most unhealthy last week?
  • How many things did it merge in the past day?

Right now I cannot because whenever we restart the mergebot everything is lost. So I wind up tracking these values since 4 hours ago instead.

At this point I am not concerned about fun with queries or concurrency. I want to be able to answer those three questions.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/kubernetes/contrib/issues/1042#issuecomment-221172045

eparis commented 8 years ago

I'm hearing about saving and analyzing lists of time series data. Neither are things I want the mungegithub tools to get much better at. I'd rather mungegithub remained focused on github and automating stuff around github and we build 'something else' to handle the data/visualization aspects we have been adding.

Should we be dumping (or having something poll) the e2e tests data and the history data into something like influxdb which is actually designed to hold the data? And I think has easy stuff to do visualization and understanding on that data?

I'm the one who started the process of showing stats in the submit-queue, but as we want more we're probably best to ask what the right solution is, not what continues to be 'easy' to bolt onto the side...

Especially since in my mind the thing that would be GREAT to save across reboot is the proxied cache of github object state. So we don't have such a slow re-start and we don't run out of API tokens on restart...

lavalamp commented 8 years ago

I think @eparis is probably right, the best thing would be to publish metrics (I think @apelisse or @rmmh already started publishing via promethieus?) and scrape them regularly so we can get time-series data for this stuff.

...however I'm super interested in getting something working yesterday. I'm OK with different short- and long- term solutions.

eparis commented 8 years ago

@lavalamp no fights from me.

lavalamp commented 8 years ago

Remembering the queue order across restarts would be super handy, too. Right now it's chugging away on an unimportant PR instead of 1.3 PRs.

lavalamp commented 8 years ago

I don't think we have an immediate need here now. We keep the github cache and the stat-scraping script handles queue restarts. Closing.