go-graphite / carbonapi

Implementation of graphite API (graphite-web) in golang
Other
308 stars 140 forks source link

Move top-level domain knowledge to leaf nodes #343

Open gunnihinn opened 6 years ago

gunnihinn commented 6 years ago

One of the efforts we make to do less work is to keep track of what backends have what metrics by top-level domains. For example, if a backend contains metrics in the namespace

foo.bar

we can know that we don't have to bother asking it for metrics of the form spam.eggs.

This information is currently kept in the root BroadcastGroup node using the pathcache package. I think that's overkill for two reasons:

  1. It makes the BroadcastGroups smarter. They play a "coordinator" role in a scatter/gather architecture, and the dumber they are, the easier everyone's lives are.
  2. The pathcache package does unnecessary work to expire the keys we put in there. What top-level domains a store contains changes so infrequently that the minutely-ish refreshes we do in the background are perfectly fine to keep that information up to date.

Instead I propose we move the knowledge of what top-level domains a store contains to the leaf nodes; the non-BroadcastGroup implementers of the ServerClient interface. They can keep a set (fine, map[string]struct{}) of top-level domains around, and expose a (common) method that can check whether a list of top-level domains intersects with that set. The BroadcastGroups can then use that to filter out the list of leaves when sending out requests.

This also has the benefit of simplifying the ProbeTLDs method in the BroadcastGroups, as its job would become to just farm out ProbeTLD requests to the group's children, so we would no longer need to wait for or coordinate responses in it.

This issue is on our backlog. It's suitable for first-time contributors if anyone's interested in becoming one.

Civil commented 6 years ago

I'm not sure that it's necessary thing to do, maybe as a low-prio thing. It'll mostly move complexity from broadcast group to client group.

But I agree that it will make code of broadcast group more readable and maintainable.

P.S. I've also added a label for such cases.

Civil commented 5 years ago

Actually it might be not that easy. As apparently previous attempt of optimising that path lead to a very nasty bug that dramatically degraded performance with high concurrency.

Issue there was that because of the optimization, correct limiter that have slots was not used anymore which lead to a deadlock in some cases (query was waiting for a slot that was occupied by query itself).

But still a good task to reimplemet that optimization but also to also be careful with limiters (e.x. move grabbing slot to the Fetch function itself)