Open timstclair opened 8 years ago
I'm leaning towards option 1.
Having had the experience we have being a downstream consumer of Kubernetes in OpenShift, I am really scared of what it would entail to keep forks up to date as cadvisor continues to evolve. If we do end up choosing option 2, we would need to have explicit guarantees about API compatibility, so fork maintainers have as few headaches as possible when it comes time to rebase cadvisor into their fork.
+1 for 1
as well. It should be fairly straightforward to make the
existing storage plugins logic a standalone entity.
Polling might introduce time sync issues, but we can hopefully address that
with a general purpose library that all storage plugins can re-use.
On Wed, Sep 14, 2016 at 10:19 AM, Andy Goldstein notifications@github.com wrote:
I'm leaning towards option 1.
Having had the experience we have being a downstream consumer of Kubernetes in OpenShift, I am really scared of what it would entail to keep forks up to date as cadvisor continues to evolve. If we do end up choosing option 2, we would need to have explicit guarantees about API compatibility, so fork maintainers have as few headaches as possible when it comes time to rebase cadvisor into their fork.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/cadvisor/issues/1458#issuecomment-247088355, or mute the thread https://github.com/notifications/unsubscribe-auth/AGvIKGw_AL7eKIUrIJx9fmsv0BcrygP2ks5qqCyogaJpZM4J8O7l .
Yeah, a general purpose library sounds good. We could implement it so that it uses the exact same StorageDriver interface we already have, and all that would need to be done is build a binary out of it.
I'll put together a PoC. Any objections to using go's standard net/rpc
package over a unix socket?
Why not use the existing REST API? We have enough trouble maintaining that one. We are yet to deprecate the v1 API and move to a v2 after completing it.
On Wed, Sep 14, 2016 at 2:46 PM, Tim St. Clair notifications@github.com wrote:
I'll put together a PoC. Any objections to using go's standard net/rpc package over a unix socket?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/cadvisor/issues/1458#issuecomment-247165731, or mute the thread https://github.com/notifications/unsubscribe-auth/AGvIKBhAcEs-eYa64IPjhq-cB36Dnwjtks5qqGtMgaJpZM4J8O7l .
I think using the existing rest API would require polling. I was envisioning a push model where cAdvisor connects to a "server" and sends an "AddStats" RPC every time AddStats would normally be called on the driver.
Ahh ok! Just as a thought experiment, would it be easier from a maintenance standpoint if we were to handle polling and the issues with polling in a library and not add a new API?
On Wed, Sep 14, 2016 at 3:56 PM, Tim St. Clair notifications@github.com wrote:
I think using the existing rest API would require polling. I was envisioning a push model where cAdvisor connects to a "server" and sends an "AddStats" RPC every time AddStats would normally be called on the driver.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/cadvisor/issues/1458#issuecomment-247182114, or mute the thread https://github.com/notifications/unsubscribe-auth/AGvIKAsQVC_n_AJHgnB_doO0ErjVzxOzks5qqHusgaJpZM4J8O7l .
Sorry, I sent that too soon :)
Yes, I share your concerns about the maintenance problem, specifically around version skew. However, I'm not sure that using the REST endpoints makes it that much easier, since it would still be a problem to deprecate v1. What if instead we include a Version
rpc call that returns the API version supported? That way the client can handle different versions, and we can eventually deprecate the older ones.
I think the advantages of implementing it this way are:
The initial server RPCs would be:
type Server interface {
AddStats(info *v1.ContainerInfo, _ *struct{}) error
Version(_ struct{}, version *string) error
}
WDYT?
proto is definitely better than json. Do we need an RPC interface though? If we go down the rpc + protobuf route you might want to check out https://capnproto.org/
On Wed, Sep 14, 2016 at 4:16 PM, Tim St. Clair notifications@github.com wrote:
Sorry, I sent that too soon :)
Yes, I share your concerns about the maintenance problem, specifically around version skew. However, I'm not sure that using the REST endpoints makes it that much easier, since it would still be a problem to deprecate v1. What if instead we include a Version rpc call that returns the API version supported? That way the client can handle different versions, and we can eventually deprecate the older ones.
I think the advantages of implementing it this way are:
- Stats are pushed immediately, no lag from polling
- Scraping the REST endpoint would require significantly more CPU & memory, since it needs to marshal & unmarshal the JSON for all containers at each polling interval, and check which stats are new.
- The RPC implementation is actually much simpler, for the above reasons.
The initial server RPCs would be:
type Server interface { AddStats(info v1.ContainerInfo, _ struct{}) error Version(_ struct{}, version *string) error }
WDYT?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/cadvisor/issues/1458#issuecomment-247185882, or mute the thread https://github.com/notifications/unsubscribe-auth/AGvIKIdzULUJPh463hbYZGJuLofhOC-Zks5qqIBAgaJpZM4J8O7l .
Do we need an RPC interface though?
I don't have much experience with IPC programming. What are the alternatives?
If we go down the rpc + protobuf route
Do we need protobuf? I was thinking of just using go's built in rpc library
I put together the simplest implementation I could in https://github.com/google/cadvisor/pull/1462
This is low priority, but I'd appreciate feedback if you get a chance. Id like to get this done for cAdvisor v0.25
.
protobuf is going to be harder to support/debug than json. suggest we allow both so that implementers of plugins can use json during development and switch over to protobuf once their plugins are mature
we also need to version this API to support adding new features while maintaining backward compatibility with older plugin versions, probably. If we externalize storage drivers, driver implementations will always evolve at different paces.
sorry, I see that you have addressed versioning already :-)
I was just reading up on go plugins. There's a proposal for 1st class dynamically linked plugins here which might (?) go out with go1.8. Another option that looks interesting is github.com/natefinch/pie, which uses stdin/out + rpc stdlib + gob encoding for IPC.
I think there are enough nuances of this issue that it deserves a formal design proposal. There are still some open questions about the trade-offs of a local (IPC) vs remote (push stats over HTTP), and both of those have their own set of challenges.
Another thought that hasn't been discussed yet. I'm wondering if the new storage driver model should have some mechanism for controlling the interval at which AddStats is called that is decoupled from the --housekeeping_interval
flag. As a user, is there a scenario where I may want cAdvisor to use dynamic housekeeping but only flush stats to my storage backend every minute for example?
I achieved this in our storage driver by calculating the number of seconds between intervals and only flushing when a specified amount of time has passed.
Is this something that should be baked into the design or should we assume users will use the --housekeeping_interval
flag. If we were to add something like this as an option it could be a new flag. Something like --storage_flush_interval
.
WDYT?
option 1 +1,
The message queue may be a good candidate, like rabbitmq. They are all language independent and do have a good performance with push model. Some light weight network library zeromq, also referred as message queue may be an alternative. What's more, by using mq, add some more functions like publish-consumer, filter, etc. We could take advantage of them.
We also could implement it by customizing a protocol over tcp, it takes time and make the user spend more time implementing the receiver.
Hi, I can see this issue had almost no activity in past year. What's currently the best way to add custom driver?
There are now many frameworks (telegraf, opentelemetry) that can injest prometheus metrics, and output to storage backends. We should encourage using converters from the prometheus format, since that endpoint is the best-maintained. We should not accept new storage drivers for the time being. If we decide to eventually remove existing ones, we should ensure there are easy migrations for users.
If we do want push-based metrics, we could consider using the OpenTelemetry collector format.
Besides OpenMetrics there is also an RFC in the making: https://datatracker.ietf.org/doc/draft-richih-opsawg-openmetrics/. @bwplotka can elaborate on this subject, I think.
Thanks for this discussion. I don't want to pollute too much, but isn't that integration problem solved already in kube-state-metrics? 🤗 (indeed in ~OpenMetrics way), Or am I missing something?
cc @lilic @brancz @RichiH @SuperQ @robskillington
Our current model for cAdvisor storage backends involves checking in code that translates the cAdvisor API and pushes out or exposes the stats for each storage backend. The problem with this approach is that we don't have a good story for maintenance of all the storage drivers, and expecting the current cAdvisor maintainers to maintain a driver for every storage solution on the market is unscalable.
For example, we currently have 6 PRs open to add new storage drivers: https://github.com/google/cadvisor/pulls?q=is%3Aopen+is%3Apr+label%3Aarea%2Fstorage
There are a couple ways we could handle storage drivers if we moved them out of cAdvisor
Any other ideas for how to deal with this issue?
@google/cadvisor @vishh @dchen1107 @jimmidyson @derekwaynecarr @pmorie @ncdc