trickstercache / trickster

Open Source HTTP Reverse Proxy Cache and Time Series Dashboard Accelerator
https://trickstercache.org
Apache License 2.0
1.99k stars 177 forks source link

[Question] Use with Prometheus High Availability? #33

Closed geekdave closed 3 years ago

geekdave commented 6 years ago

I'm super excited about this project! Thanks for sharing it with the community!

I had a question about this part of the docs:

In a Multi-Origin placement, you have one dashboard endpoint, one Trickster endpoint, and multiple Prometheus endpoints. Trickster is aware of each Prometheus endpoint and treats them as unique databases to which it proxies and caches data independently of each other.

Could this work for load balancing multiple Prometheus servers in a HA setup? We currently have a pair of Prometheus servers in each region, redundantly scraping the same targets. Currently our Grafana is just pinned to one Prometheus server in each region, meaning that if that one goes down our dashboards go down until we manually change the datasource to point to the other one (and by that point we would have just restored the first server anyway). It's kind of a bummer because it means that while HA works great for alerting itself, it doesn't work for dashboards.

Would be awesome if there was a way to achieve this with Trickster!

jranson commented 6 years ago

@geekdave Hey Dave - I'm liking your enthusiasm! You are about the 20th person or organization to ask for this functionality, so clearly there is a need for it.

Right now Trickster "Multi-Origin" is more analogous to something like Virtual Web Hosting in Apache, where it's a one-to-one proxy request to origin request, and some information from the client informs the proxy about with which upstream origin to sync. So that won't really work for HA needs.

Having said that, what we are hoping we can incorporate to meet the ask is an HA feature, where Trickster would make the identical queries in parallel to multiple origins, and merge the data back into a single document (first origin to respond would be authoritative for the dataset and subsequent origin responses would fill in any gaps in the timeline as needed).

The Comcast team is looking into this capability and will use this issue you raised to track progress. Let us know if this would meet your needs or if we need to consider additional capabilities for this. Thank you for hitting us up!

roidelapluie commented 6 years ago

That can be achieved by putting thanos in the middle of the road https://github.com/improbable-eng/thanos

jranson commented 6 years ago

A Trickster HA feature would be much simpler than Thanos, as we'd only be a cache of merged data, rather than being an origin source of truth like Thanos. Ideally we should aim to simplify architecture, especially when it comes to Operational Visibility, as every new piece of the puzzle becomes another point of failure. So running Proms->Thanos->Trickster increases the chances of failure and is probably overly complex for a lot of user's needs. So having Trickster do the merge at the query level (a minor extension of the existing functionality) would eliminate the need for yet more infrastructure to accomplish the same goals.

jacksontj commented 6 years ago

One downside of putting thanos in the middle is that it requires a specific setup of your prometheus stack (sidecar, etc.). You could use promxy which soely is an aggregating proxy for HA prometheus setups. This doesn't solve the issue of "2 isn't as good as 1" but they should both work together without issue-- and we could work to integrate the 2 projects together. From looking at the code here for trickster all the implementation is in the main package, which makes it basically impossible to integrate elsewhere, would you guys be up for some code refactor to potentially allow for reuse?

Nokius commented 4 years ago

Found out about your project today and in your talk during CloudNativeCon last year you mentionted HA avalibility. This should be the ticket for the topic.

Is it planed to add the feature, or does anyone use the trickster with the metioned and promessing sounding project Promxy?

Thanks for you work and shraing it with others, I appreciated this!

SuperQ commented 4 years ago

We use Thanos Query proxy to handle our HA setup.

Nokius commented 4 years ago

@SuperQ chould I use the Thanos Query proxy without the other componense of Thanos?

SuperQ commented 4 years ago

You need the sidecars and query only. You don't need the other components. You can also run the sidecars without any external storage. Just as an aggregation proxy.

Nokius commented 4 years ago
                            +-----------+
                            | trickster |
                            +-----+-----+
                                  |
                        +---------v-----------+
                        |                     |
                        | thanos-io - Querier |
                        |                     |
                        +---------+-----------+
                                  |

+--------------------------------+ | +---------------------------------+ | prometheus | thanos-io sidecar <----+-----> prometheus | thanos-io sidecar | +--------------------------------+ +---------------------------------+

basicly a setup like this you mean?

Thanks I will look into this 👍

jacksontj commented 4 years ago

Author of promxy here -- there are quite a few people that are using promxy + trickster successfully in production. Promxy is simply an aggregating proxy of other prometheus API endpoints, so it doesn't require additional sidecars etc. In addition in can actually aggregate other compatible services (such as VictoriaMetrics -- since its also a prom API).

Nokius commented 4 years ago

@jacksontj Thanks for joining the discussion. Thanks for you comment I had the idea first to use promxy as it less complex as thanos. I didn't found any documentation in your repo for the setup of promxy, somehow /wiki is pointing to the repo itself again.

Yeah the VictoriaMetrics part sounds great too, I wanna go use this in a second step after having a HA prometheus and quicker querying

jacksontj commented 4 years ago

@Nokius there isn't a lot of detailed documentation largely because its very familiar configuration (although if you have some free time and want to contribute some better docs I'm all for it :) ) -- the easiest way is to take a look at the example config (https://github.com/jacksontj/promxy/blob/master/cmd/promxy/config.yaml) which basically docs everything. If you run into any questions/comments/concerns definitely feel free to create an issue on the repo.

And I'll point out (I think its clear already, but just in case it was missed) -- promxy only does aggregation -- it doesn't solve the long-term storage problem (which is something thanos is working to solve).

Nokius commented 4 years ago

@jacksontj Thanks I will look into it hopefully next week over next have to check how busy I'm. Sure, as I have to document anyway my work I could may get it out for public.

Thanks guys for the great feedback!

mehyedes commented 4 years ago

We are currently using HAProxy between trickster and 2 prometheus nodes. It would automatically make the switch the backup node if the primary node goes down. I has been working fine for us for a while


                +---------------+                 
                |               |                 
                |    Grafana    |                 
                +---------------+                 
                        |                         
                 +--------------+                 
                 |  Trickster   |                 
                 |              |                 
                 +--------------+                 
                        |                         
                        |                         
                +---------------+                 
                |               |                 
                |    HA Proxy   |                 
                +---------------+                 
                        |                         
                        |                         
                        |                         
              +--------- ------------+            
              |                      |            
              |                      |            
              |                      |            
    +-----------------+   +------------------+    
    |   Prometheus    |   |     Prometheus   |    
    |    (Active)     |   |      (Backup)    |    
    +-----------------+   +------------------+    
jacksontj commented 4 years ago

I have used a similar setup (haproxy + prometheus) before. Just as a heads up, the main problem you'll run into is if a prometheus node misses some scrapes (OOMs, restarts, updates, etc.) it'll have a "hole" in the metrics (and since they are separate nodes they'll each have different holes that won't fill). This was one of the driving reasons to create promxy (to get HA and fill the holes). So although this is a simpler setup it does have that one drawback -- which might be sufficient for your use-case; I just want to make sure people understand the tradeoffs :)

mehyedes commented 4 years ago

@jacksontj Absolutely :) In our case, we can tolerate "holes" in our metrics, although that happens only rarely. But in other use cases, this might not be acceptable.

jranson commented 4 years ago

Given the Promxy option (data redundancy), and the HA Proxy option (uptime only), does anyone object to closing this issue/question as sufficiently answered? Regarding the use of exportable packages, we are tracking this closely but separately with @jacksontj under #239.

jranson commented 3 years ago

All, Trickster 2.0 is now in Beta, and it offers a new High Availability / Federation option for Prometheus, as well as other application load balancer capabilities. We hope to support HA for other TSDB providers in addition to prom in a subsequent beta. You can read all about it here. We'd love for you to give it a try and report any problems or enhancement requests on this issue or by filing a new one.

jranson commented 3 years ago

Closing as implemented.