thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.73k stars 2.04k forks source link

gRPC: Isolate Info to separate gRPC API so we can reuse discovery for different RPCs: Store, Rules and Targets #2600

Open bwplotka opened 4 years ago

bwplotka commented 4 years ago

Something decided on our public meeting: https://docs.google.com/document/d/165L9DVKHW0sS2PfKYsUCDikwq5XngKkCqTOVjzC6qQE/edit#heading=h.7dcto2wgvh4q

Something to do in later step. This will avoid us to have all those weird flags like separate SD, separate store etc. We can disable certain APIs if needed on API level (e.g ruler can disable RulesAPI if needed) (:

cc @s-urbaniak

povilasv commented 4 years ago

I think I would vote for new flag and StoreInfo endpoint returning types supported, like previous proposal mentioned.

Plus on the sidecar you could do --rule-api=disabled , --store-api=disabled or smth similiar.

Thoughts? Maybe we can write a mini proposal for this?

I'm also thinking that potentially we could do TargetsAPI which would show leaf node Targets scraped (I mean Prometheus /targets endpoint) and dedup them? Thoughts?

bwplotka commented 4 years ago

I think I would vote for new flag

New flag? what flag?

and StoreInfo endpoint returning types supported, like previous proposal mentioned.

Hm, yea we wanted to different info per type so its kind of equivalent. So we can detect things either by grpc reflection or info... what would be better? :thinking:

bwplotka commented 4 years ago

Also for target API there is issue already: https://github.com/thanos-io/thanos/issues/1375

bwplotka commented 4 years ago

Ok we discussed offline that it would be amazing to actually split discovery as separate gRPC API.

I think we can do it after Rules API is landed, but overall idea is that Info endpoint is shared and Querier will only wait for --endpoint flag or something which will then ask for info API and THEN discover what APIs are there (: This will ensure things like https://github.com/thanos-io/thanos/pull/2407#discussion_r414743260 and per store TLS will be much easier to implement.

cc @brancz @s-urbaniak @kakkoyun @squat @povilasv @GiedriusS

povilasv commented 4 years ago

:+1: I like --endpoint

bwplotka commented 4 years ago

Changed title.

stale[bot] commented 4 years ago

Hello 👋 Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 4 years ago

Closing for now as promised, let us know if you need this to be reopened! 🤗

GiedriusS commented 4 years ago

Still valid.

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 3 years ago

Closing for now as promised, let us know if you need this to be reopened! 🤗

bwplotka commented 3 years ago

We need that! (:

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 3 years ago

Closing for now as promised, let us know if you need this to be reopened! 🤗

yeya24 commented 3 years ago

Still valid.

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

bwplotka commented 3 years ago

FYI: This assumes Querier is the one that federates "stuff" (:

lilic commented 3 years ago

I would be interested in tackling this as due to a possible bug we had recently I noticed that rules are tightly coupled to stores so when we try to resolve rules but stores fail sadly nothing gets served.

But first, let me try to explain if I understand correctly what we want to achieve: This issue wants to decouple serving stores and rules and in the future targets when they are added. By passing an --endpoint it will auto-discover what type that address is, e.g. rules or stores API, and discover them and use those API sets separately. We would also have a generic lets call it "endpointset" or "infoset" which does this. The endpointset would have a GetRuleAPIs/GetStoreAPIs which would replace the current storeset.Get and storeset.GetRulesClients. With this we could also decouple the rules away from stores, so both could be discovered and served independently.

I believe also having support for --store flag makes sense, and using that as the default endpoint to discover all stores there, would make sense. And since --rule is not exposed we could deprecate this straight away?

Let me know if anything is wrong or doesn't make sense, thanks!

bwplotka commented 3 years ago

By passing an --endpoint it will auto-discover what type that address is, e.g. rules or stores API, and discover them and use those API sets separately.

Yes, but with a note that endpoints can serve multiple APIs

We would also have a generic lets call it "endpointset" or "infoset" which does this. The endpointset would have a GetRuleAPIs/GetStoreAPIs which would replace the current storeset.Get and storeset.GetRulesClients. With this we could also decouple the rules away from stores, so both could be discovered and served independently.

hmmmm, I wouldn't use word independently. If the endpoint serves both Rules and Store API, those will be discovered (and probed for healthyness) in the same flow. Essentially we will do it per endpoint -no matter how many and what kind of APIs it serves. From endpointset it has to serve Info

bwplotka commented 3 years ago

I believe also having support for --store flag makes sense, and using that as the default endpoint to discover all stores there, would make sense. And since --rule is not exposed we could deprecate this straight away?

Yes, but I would deprecate store as well long term in favor of endpoints. WDYT? also note there is store.sd-file option as well allowing FILE SD like discovery - we will need to do that for endpoints as well :hugs:

lilic commented 3 years ago

hmmmm, I wouldn't use word independently. If the endpoint serves both Rules and Store API, those will be discovered (and probed for healthyness) in the same flow. Essentially we will do it per endpoint -no matter how many and what kind of APIs it serves. From endpointset it has to serve Info

yes, I think we are on the same page, I agree decoupling is not exactly the right word for this.

Yes, but I would deprecate store as well long term in favor of endpoints. WDYT?

sgtm

Okay, I will get started on this tomorrow, busy day of meetings today. Thanks for the feedback!

lilic commented 3 years ago

@bwplotka curious about this point from earlier in this issue:

So we can detect things either by grpc reflection or info... what would be better? 🤔 Ok we discussed offline that it would be amazing to actually split discovery as separate gRPC API.

Curious, did some deep diving into the code and curious how come you all moved away from using the gRPC reflection? I did some experimenting and I can discover what APIs the endpoint has. So for example in my little experiment, I query via the gRPC reflection services and when trying to discover in thanos query I can see my sidecar has the following services/APIs (e.g. Rules and Store):

grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection
thanos.Rules
thanos.Store

Would this not be enough info to discover the passed --endpoint APIs? Why was the decision towards moving to a separate gRPC info API? Thanks!!

bwplotka commented 3 years ago

Hm I might think separate InfoAPI might be orthogonal to endpoint discovery.

We still need to discover types and we can use reflection yes - we added that recently to our gRPC servers 🤗 We could use health endpoint for health( probably we should even)

The problem is with metadata. They are propagated through Info like: time scope, ext labels, source and potentially others in future. But ofc we are happy for amazing proposition that will improve things here! (:

lilic commented 3 years ago

We could use health endpoint for health( probably we should even)

Agreed, was thinking of that as well!

The problem is with metadata. They are propagated through Info like: time scope, ext labels, source and potentially others in future.

For the rules in query at least, we just need the external labels yes, in the info, correct? (At least for now.)

The main thing I am even bringing it up is because if we decide to create a new info API, we would still need to fallback to the one as part of the store. Migration would not work otherwise, or would not be as smooth, you could lose connection to existing components if discovery does not work.

I can see we have a few options, that I can think of: 1.Perform initial discovery via reflection and after that getInfo metadata out of Store API, if both Rules and Store are present.

  1. Add info/medatada endpoint to the rules API, same as right now Store has Info (and potentially to other APIs in the future ,e.g. targets API), and perform discovery via reflection initially, the info in the Rules would expose extra metadata about external labels for example. This would remove the dependency of store from the rules API.
  2. Stick to the initial plan around adding a new Info API which would be the same as the current info under Store API. This would have all the external metadata, but the problem with this is we still need to fallback to same Store Info metadata, due to the migration problems mentioned above.

Personally I like 2 the most, as it treats its API completely separately, but up to you all as you know more about this, happy to go forward with whatever solution! 🎉

Follow up questions: Do we ever expect rules API to be there, but not store API? What guarantees do we give to the users, when migrating between versions?

Thanks for the answers and reading! 🤗

brancz commented 3 years ago

Come to think of it, the scalable rule storage proposal will cause us to have Rule APIs without a Store API, so I also think 2 is the best choice, so we can treat them entirely separately. Taking this a bit further, did we even truly need external labels in the rule deduplication? Wouldn't any identifier that uniquely identifies a Rule API be sufficient (if so, it appears to me that the IP/address of the Rule API would be enough, thus no Info call necessary at all at this point)? It's entirely possible that I'm missing something, but it appears to me that we had the external labels, and just used that as our hammer.

In any case, option 2 is what I feel is the best.

bwplotka commented 3 years ago

Hm. I would go for 4 (unless one of those is what is 4 to me):

Why not:

  1. Splitting InfoAPI from StoreAPI to separate rpc Info proto service
  2. Making sure exactly the same Info is implemented by all Store,Rule,Target,Metadata,Examplar servers once This means that existing StoreAPIs implementations will work automatically
  3. Have specific APIs data in store StoreInfo fields of InfoResponse if separation is needed?

This allows having separation, clear mechanism, clean code, and backward & forward compatibility, no?

I might be missing why this direction will not work?

lilic commented 3 years ago

@bwplotka I think 4. is 1. :) But that would mean they all should have the same metadata in the Info API, or at least some common ones? But from what I understand not all have the same metadata, and judging by @brancz comment when rules API is used separately from store API, what would that metadata would the info API expose?

In case we go with this approach, what should be required in that Info service, which metadata and what would be optional?

bwplotka commented 3 years ago

Well 1 means only store would have info - to me all of them would need some sort of metadata propagation sooner or later. To me unified Info endpoints means just clarity and less complexity since we have unified metadata propagation. Additionally better flexibility for future. What's the benefit of having different logic fir each kind of mix of APIs?

But let's try to "guess" what we might have:

So StoreAPI is clear:

RulesAPI:

MetadataAPI?

Exemplars

By just yolo listing all APIs I see the commonality here , that's why instinct says to me that Info should be single one, reused across all APIs.

bwplotka commented 3 years ago

The extra benefit is mixed APIs e.g StoreAPI and Exemplars we can then have reused LabelSet, Min Max, StoreType (I think it makes sense to reuse those), plus only one propagation communication channel not two.

lilic commented 3 years ago

Nice, thanks for the details! I guess having two metadata that are required/common throughout all the info APIs, makes sense, as we have the StoreType and LabelSets. I am guessing the StoreType would now be ComponentType instead?

Also for migration purposes, we will fallback to a different discovery if no Info API exists, as to ensure a smooth migration?

bwplotka commented 3 years ago

Also for migration purposes, we will fallback to a different discovery if no Info API exists, as to ensure a smooth migration?

In what sense? Currently every component that is used by querier have InfoAPI implemented and served.

lilic commented 3 years ago

@bwplotka hmm, I meant more like during upgrade path, old thanos component that has not been upgraded (e.g. sidecar) would not have the new Info API, and we would need to fallback to what we currently do, in case of query that is using info from store. But not sure I follow what you mean by has info API, rules does not, or am I wrong here? 🤔

brancz commented 3 years ago

I think I made the wrong call earlier, I think maybe it is time for a design doc. It's getting rather hard to follow what's suggestion, and what's proposed at this point.

FWIW if I understand correctly, there is an Info API in the room that any component would expose. I don't really like this idea as it doesn't make sense for a component that only serves the Rule API for example to expose an API that seemingly returns min/max time. This feels like putting things into the same API because of accidental overlap. Feels to me like the rule of "prefer duplication over the wrong/premature abstraction".

lilic commented 3 years ago

I think I made the wrong call earlier, I think maybe it is time for a design doc. It's getting rather hard to follow what's suggestion, and what's proposed at this point.

Fine by me, do you want me to create one? If yes, which proposed solution, or should I outline both of them?

bwplotka commented 3 years ago

@lilic in design doc (we do them on Github), please check ideally, we can have the proposed one chosen by you and other in ## alternatives section.

This feels like putting things into the same API because of accidental overlap. Feels to me like the rule of "prefer duplication over the wrong/premature abstraction".

I don't think premature abstraction is when we already know external labels are used right now in Rules API and we have concrete example of need of those in first iterations of Metadata and Exemplaers.

bwplotka commented 3 years ago

I would call it premature duplication of effort and introducing complexity of migrations all over the place.

bwplotka commented 3 years ago

It has to be one propagation pipeline, can't see other way.

Nothing bad would be to have a similar technique as we have with hints: https://github.com/thanos-io/thanos/blob/c73143d131953fe8daa7975f3ba3dfd7a9785073/pkg/store/storepb/rpc.proto#L109 or something.

Looks like a need to for design yes (:

bwplotka commented 3 years ago

hmm, I meant more like during upgrade path, old thanos component that has not been upgraded (e.g. sidecar) would not have the new Info API,

hmm, yea my idea was that even if you separate InfoMethod to separate gRPC service, this does not break existing StoreAPIs as long as we keep same proto format of message. But it might not work like that, worth to check.

brancz commented 3 years ago

Let's discuss it on a design doc. We're not that far off from each other. I was not talking about label-sets, in my last comment I agreed that label-sets are in common and that there could be a common API for it. Min/max time is an example that I'm talking about that does not universally apply.

bwplotka commented 3 years ago

Looks like design doc is merged, so help wanted to implement this 🤗

lilic commented 3 years ago

I can get started in roughly a week or two. If no one starts that by then I can take it. Does that sound good?

bwplotka commented 3 years ago

hitanshu-mehta commented 3 years ago

Hi @lilic! Have you started working on this? Because we are also planning to work on this proposal during my LFX mentorship 🙂 //cc @prmsrswt @squat

lilic commented 3 years ago

Feel free to take it! If you have any questions around the proposal feel free to reach out!

bill3tt commented 3 years ago

@hitanshu-mehta are you currently implementing this proposal? If so let's add you as the assignee of this ticket so it is 'official' :)

hitanshu-mehta commented 3 years ago

@ianbillett Yes, I have started implementing the proposal. Here is the draft PR :)

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 2 years ago

Closing for now as promised, let us know if you need this to be reopened! 🤗

matej-g commented 2 years ago

Just to keep tabs on this, since now https://github.com/thanos-io/thanos/pull/4282 has been merged, what's still missing? Some points:

cc @hitanshu-mehta - as I said if you'd like a hand with these, happy to help