Rethink storage driver model

timstclair commented 8 years ago

Our current model for cAdvisor storage backends involves checking in code that translates the cAdvisor API and pushes out or exposes the stats for each storage backend. The problem with this approach is that we don't have a good story for maintenance of all the storage drivers, and expecting the current cAdvisor maintainers to maintain a driver for every storage solution on the market is unscalable.

For example, we currently have 6 PRs open to add new storage drivers: https://github.com/google/cadvisor/pulls?q=is%3Aopen+is%3Apr+label%3Aarea%2Fstorage

There are a couple ways we could handle storage drivers if we moved them out of cAdvisor

Implement the drivers as a shim process which gets stats pushes from cAdvisor (e.g. through IPC or a REST endpoint), and takes the appropriate action. This could be implemented as a separate process in the same container, or another container (a "sidecar").
Encourage wrapper implementations or forks of cAdvisor which add the necessary functionality for each driver. We could create an "official" list of forks organized by drivers to make these more discoverable. We could also reorganize the cAdvisor main package to make this easier.

Any other ideas for how to deal with this issue?

@google/cadvisor @vishh @dchen1107 @jimmidyson @derekwaynecarr @pmorie @ncdc

ncdc commented 8 years ago

I'm leaning towards option 1.

Having had the experience we have being a downstream consumer of Kubernetes in OpenShift, I am really scared of what it would entail to keep forks up to date as cadvisor continues to evolve. If we do end up choosing option 2, we would need to have explicit guarantees about API compatibility, so fork maintainers have as few headaches as possible when it comes time to rebase cadvisor into their fork.

vishh commented 8 years ago

+1 for 1 as well. It should be fairly straightforward to make the existing storage plugins logic a standalone entity. Polling might introduce time sync issues, but we can hopefully address that with a general purpose library that all storage plugins can re-use.

On Wed, Sep 14, 2016 at 10:19 AM, Andy Goldstein notifications@github.com wrote:

I'm leaning towards option 1.

Having had the experience we have being a downstream consumer of Kubernetes in OpenShift, I am really scared of what it would entail to keep forks up to date as cadvisor continues to evolve. If we do end up choosing option 2, we would need to have explicit guarantees about API compatibility, so fork maintainers have as few headaches as possible when it comes time to rebase cadvisor into their fork.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/cadvisor/issues/1458#issuecomment-247088355, or mute the thread https://github.com/notifications/unsubscribe-auth/AGvIKGw_AL7eKIUrIJx9fmsv0BcrygP2ks5qqCyogaJpZM4J8O7l .

timstclair commented 8 years ago

Yeah, a general purpose library sounds good. We could implement it so that it uses the exact same StorageDriver interface we already have, and all that would need to be done is build a binary out of it.

timstclair commented 8 years ago

I'll put together a PoC. Any objections to using go's standard net/rpc package over a unix socket?

vishh commented 8 years ago

Why not use the existing REST API? We have enough trouble maintaining that one. We are yet to deprecate the v1 API and move to a v2 after completing it.

On Wed, Sep 14, 2016 at 2:46 PM, Tim St. Clair notifications@github.com wrote:

I'll put together a PoC. Any objections to using go's standard net/rpc package over a unix socket?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/cadvisor/issues/1458#issuecomment-247165731, or mute the thread https://github.com/notifications/unsubscribe-auth/AGvIKBhAcEs-eYa64IPjhq-cB36Dnwjtks5qqGtMgaJpZM4J8O7l .

timstclair commented 8 years ago

I think using the existing rest API would require polling. I was envisioning a push model where cAdvisor connects to a "server" and sends an "AddStats" RPC every time AddStats would normally be called on the driver.

vishh commented 8 years ago

Ahh ok! Just as a thought experiment, would it be easier from a maintenance standpoint if we were to handle polling and the issues with polling in a library and not add a new API?

On Wed, Sep 14, 2016 at 3:56 PM, Tim St. Clair notifications@github.com wrote:

I think using the existing rest API would require polling. I was envisioning a push model where cAdvisor connects to a "server" and sends an "AddStats" RPC every time AddStats would normally be called on the driver.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/cadvisor/issues/1458#issuecomment-247182114, or mute the thread https://github.com/notifications/unsubscribe-auth/AGvIKAsQVC_n_AJHgnB_doO0ErjVzxOzks5qqHusgaJpZM4J8O7l .

timstclair commented 8 years ago

Sorry, I sent that too soon :)

Yes, I share your concerns about the maintenance problem, specifically around version skew. However, I'm not sure that using the REST endpoints makes it that much easier, since it would still be a problem to deprecate v1. What if instead we include a Version rpc call that returns the API version supported? That way the client can handle different versions, and we can eventually deprecate the older ones.

I think the advantages of implementing it this way are:

Stats are pushed immediately, no lag from polling
Scraping the REST endpoint would require significantly more CPU & memory, since it needs to marshal & unmarshal the JSON for all containers at each polling interval, and check which stats are new.
The RPC implementation is actually much simpler, for the above reasons.

The initial server RPCs would be:

type Server interface {
  AddStats(info *v1.ContainerInfo, _ *struct{}) error
  Version(_ struct{}, version *string) error
}

WDYT?

vishh commented 8 years ago

proto is definitely better than json. Do we need an RPC interface though? If we go down the rpc + protobuf route you might want to check out https://capnproto.org/

On Wed, Sep 14, 2016 at 4:16 PM, Tim St. Clair notifications@github.com wrote:

Sorry, I sent that too soon :)

Yes, I share your concerns about the maintenance problem, specifically around version skew. However, I'm not sure that using the REST endpoints makes it that much easier, since it would still be a problem to deprecate v1. What if instead we include a Version rpc call that returns the API version supported? That way the client can handle different versions, and we can eventually deprecate the older ones.

I think the advantages of implementing it this way are:

Stats are pushed immediately, no lag from polling

Scraping the REST endpoint would require significantly more CPU & memory, since it needs to marshal & unmarshal the JSON for all containers at each polling interval, and check which stats are new.

The RPC implementation is actually much simpler, for the above reasons.

The initial server RPCs would be:

type Server interface { AddStats(info v1.ContainerInfo, _ struct{}) error Version(_ struct{}, version *string) error }

WDYT?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/cadvisor/issues/1458#issuecomment-247185882, or mute the thread https://github.com/notifications/unsubscribe-auth/AGvIKIdzULUJPh463hbYZGJuLofhOC-Zks5qqIBAgaJpZM4J8O7l .

timstclair commented 8 years ago

Do we need an RPC interface though?

I don't have much experience with IPC programming. What are the alternatives?

If we go down the rpc + protobuf route

Do we need protobuf? I was thinking of just using go's built in rpc library

timstclair commented 7 years ago

I put together the simplest implementation I could in https://github.com/google/cadvisor/pull/1462 This is low priority, but I'd appreciate feedback if you get a chance. Id like to get this done for cAdvisor v0.25.

pmorie commented 7 years ago

protobuf is going to be harder to support/debug than json. suggest we allow both so that implementers of plugins can use json during development and switch over to protobuf once their plugins are mature

pmorie commented 7 years ago

we also need to version this API to support adding new features while maintaining backward compatibility with older plugin versions, probably. If we externalize storage drivers, driver implementations will always evolve at different paces.

pmorie commented 7 years ago

sorry, I see that you have addressed versioning already :-)

timstclair commented 7 years ago

I was just reading up on go plugins. There's a proposal for 1st class dynamically linked plugins here which might (?) go out with go1.8. Another option that looks interesting is github.com/natefinch/pie, which uses stdin/out + rpc stdlib + gob encoding for IPC.

timstclair commented 7 years ago

I think there are enough nuances of this issue that it deserves a formal design proposal. There are still some open questions about the trade-offs of a local (IPC) vs remote (push stats over HTTP), and both of those have their own set of challenges.

ezeev commented 7 years ago

Another thought that hasn't been discussed yet. I'm wondering if the new storage driver model should have some mechanism for controlling the interval at which AddStats is called that is decoupled from the --housekeeping_interval flag. As a user, is there a scenario where I may want cAdvisor to use dynamic housekeeping but only flush stats to my storage backend every minute for example?

I achieved this in our storage driver by calculating the number of seconds between intervals and only flushing when a specified amount of time has passed.

Is this something that should be baked into the design or should we assume users will use the --housekeeping_interval flag. If we were to add something like this as an option it could be a new flag. Something like --storage_flush_interval.

WDYT?

oopschen commented 7 years ago

option 1 +1,

The message queue may be a good candidate, like rabbitmq. They are all language independent and do have a good performance with push model. Some light weight network library zeromq, also referred as message queue may be an alternative. What's more, by using mq, add some more functions like publish-consumer, filter, etc. We could take advantage of them.

We also could implement it by customizing a protocol over tcp, it takes time and make the user spend more time implementing the receiver.

ujovlado commented 6 years ago

Hi, I can see this issue had almost no activity in past year. What's currently the best way to add custom driver?

dashpole commented 4 years ago

There are now many frameworks (telegraf, opentelemetry) that can injest prometheus metrics, and output to storage backends. We should encourage using converters from the prometheus format, since that endpoint is the best-maintained. We should not accept new storage drivers for the time being. If we decide to eventually remove existing ones, we should ensure there are easy migrations for users.

If we do want push-based metrics, we could consider using the OpenTelemetry collector format.

iwankgb commented 3 years ago

Besides OpenMetrics there is also an RFC in the making: https://datatracker.ietf.org/doc/draft-richih-opsawg-openmetrics/. @bwplotka can elaborate on this subject, I think.

bwplotka commented 3 years ago

Thanks for this discussion. I don't want to pollute too much, but isn't that integration problem solved already in kube-state-metrics? 🤗 (indeed in ~OpenMetrics way), Or am I missing something?

cc @lilic @brancz @RichiH @SuperQ @robskillington

google / cadvisor

Rethink storage driver model #1458