grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
24.1k stars 3.47k forks source link

Query ingesters in different clusters for in-memory/WAL logs without forcing membership #13353

Open Bear-LB opened 5 months ago

Bear-LB commented 5 months ago

I've made a configuration for Loki that consists of 2 Loki clusters that are separated from eachother in each network zone and they use the same storage platform. This architecture somewhat works.. But there's a problem when using a single querier and trying to retrieving in-memory/WAL logs from seperate Loki clusters. Will refer to a picture below.. Protected Network-Zone are allowed to send egress traffic to Exposed Network-Zone. Exposed Network-Zone are not allowed to send ingress traffic to Protected Network-Zone. My goal is I want to be able to read live-logs from either ingester in each network-zone from the querier. Problem is I can't read in-memory/WAL logs from the Exposed Network-Zone. I have to wait until ingester-B have filled out it's Chunks and sent it to the cloud storage.

For querier-A to query ingester-B inmemory/WAL logs I have to add ingester-B to querier-A's memberlist, i could not figure out any other way, the querier will however unpremediated tell the other Loki components to add it to their ring, and that declares all the components in every cluster to think they're in the same cluster And it will error and make the cluster unhealthy since the components in Exposed Network-Zone can't start and send new network traffic to components in the Protected Network Zone.

I'm looking for a suggestion on a alternative configuration on how to accomplish my goal with the same architecture. Otherwise the solution i'd like would be a configuration option for querier component to read from additional ingester without forcing membership or a ring.

image

Bear-LB commented 5 months ago

For some reason i got it kind off working with a hack... It still does not work perfectly Usually Loki should use the same configuration for the same cluster all the way around. But to solve my issue the querier must have a unique configuration file compared to the writer. Ingester-A:

    ingester:
      lifecycler:
        availability_zone: protected-zone
        ring:
          excluded_zones: exposed-zone

Querier-A:

    ingester:
      lifecycler:
        availability_zone: protected-zone
Ingester-B in exposed-zone: ```yaml ingester: lifecycler: availability_zone: exposed-zone ring: excluded_zones: protected-zone ```

Means i only tell the querier that it should not exclude zones... the querier actually seems to change behavior by reading the configuration in the ingester: section... I would have thought the only component to change behavior by reading the ingester: config would be the ingester component...

There's still membership between the 2 clusters but no component in either cluster seems to get unhealthy even though they don't have a fully communicating ring because of the firewall.

Protected-zone cluster must have rejoin_interval: set. Or else protected-zone-cluster will not try to re-read join_members:and re-invite exposed-zone cluster to its cluster in case exposed-zone components gets scaled/restarted since exposed-zone cluster can't advertise itself to its counterpart cluster because of firewall.

So everything somewhat works but I can't seem to stop exposed-zone from trying to gossip to protected-zone.. Which generates a ton of annoying logs...

level=warn ts=2024-07-02T17:09:31.708191396Z caller=tcp_transport.go:440 component="memberlist TCPTransport" msg="WriteTo failed" addr=172.29.1.251:7946 err="dial tcp 172.29.1.251:7946: i/ │
│ o timeout"

I've tried these configurations with very high values or zero. Changing them did nothing to change the frequency or stop the logs from getting spewed.

  gossip_interval: 
  gossip_nodes: 
  pull_push_interval: 
  retransmit_factor: