Loki SSD fails if interface is not eth0 or en0

darrikmazey commented 2 years ago

Describe the bug A minimal config for SSD fails if the network interface is not either eth0 or en0, causing services to bind to lo instead.

To Reproduce Steps to reproduce the behavior:

Started Loki (v2.4.1) on an AWS EC2 instance with default interface ens5

Expected behavior Expected services to bind to private IP address.

Environment:

Infrastructure: AWS EC2
Deployment tool: manual configuration

Screenshots, Promtail config, or terminal output On startup logs showed:

Dec 15 14:44:13 i-0c07138529555a68b loki[5836]: level=warn ts=2021-12-15T14:44:13.193460761Z caller=util.go:168 msg="error getting interface" inf=eth0 err="route ip+net: no such network interface"
Dec 15 14:44:13 i-0c07138529555a68b loki[5836]: level=warn ts=2021-12-15T14:44:13.194079795Z caller=util.go:168 msg="error getting interface" inf=en0 err="route ip+net: no such network interface"

/config showed:

  instance_interface_names:
  - eth0
  - en0
  - lo
  address: ""
  port: 0

This was rectified by adding the following to configs:

common:
  ring:
    interface_names:
      - ens5

dginhoux commented 2 years ago

Hi,

In container, it's not possible du get the device name for specifying it in cfg file... is it possible to use a wildcard or use all availables interfaces ?

DylanGuedes commented 2 years ago

That's a very complicated problem because based on some internal discussion that we had, there are scenarios where users might be badly affected by using an unwanted interface and scenarios where users are badly affected by not using a wanted interface, so picking a default that works for everyone is not an easy task.

That said, we are focusing instead on improving the configuration experience. For that, we are adding a way to configure the net interface to be used by Loki in a single place, inside the common configuration section.

For context, the main problem with the current scenario is that when you configure common: ring: interface_names, you might fall into the trap of thinking Loki will use the defined interface_names everywhere. But it will actually only use it for ring communication, not for other components (ex: the frontend, which isn't a ring). With this new configuration, this is solved, as what is defined there will be used by all Loki components.

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

dginhoux commented 2 years ago

still present and still searching for a workaround in full containerized deployements

DylanGuedes commented 2 years ago

I believe it is solved; you only have to use the common section directly instead of common/ring section:

common:
-  ring:
-    interface_names:
-      - ens5
+ interface_names:
+ - ens5

dginhoux commented 2 years ago

Yes, this work in a bare metal setup where it's easy to get the net devname. But in fully containerized like kub or swarm env, with multiple networks.... ? howto ?

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

dginhoux commented 2 years ago

I'll try in fews days if 2.5 release help when deploying in containers swarm and kube.

vladmiller commented 5 months ago

Sooo... any solution for EKS on IPv6 for this issue? I've tried setting ::1 manually,

loki:
  commonConfig:
    ring:
      instance_addr: ::1

but then I'm getting

level=info ts=2024-06-04T11:19:46.231869827Z caller=main.go:120 msg="Starting Loki" version="(version=3.0.0, branch=HEAD, revision=b4f7181c7a)"
level=info ts=2024-06-04T11:19:46.232898195Z caller=server.go:354 msg="server listening on addresses" http=[::]:3100 grpc=[::]:9095
level=info ts=2024-06-04T11:19:46.233751682Z caller=modules.go:730 component=bloomstore msg="no metas cache configured"
level=info ts=2024-06-04T11:19:46.233866322Z caller=blockscache.go:420 component=bloomstore msg="run ttl evict job"
level=info ts=2024-06-04T11:19:46.233914608Z caller=blockscache.go:380 component=bloomstore msg="run lru evict job"
level=info ts=2024-06-04T11:19:46.233925012Z caller=blockscache.go:365 component=bloomstore msg="run metrics collect job"
level=info ts=2024-06-04T11:19:46.243431409Z caller=table_manager.go:273 index-store=tsdb-2024-04-01 msg="query readiness setup completed" duration=3.134µs distinct_users_len=0 distinct_users=
level=info ts=2024-06-04T11:19:46.243486727Z caller=shipper.go:160 index-store=tsdb-2024-04-01 msg="starting index shipper in RO mode"
level=info ts=2024-06-04T11:19:46.244504387Z caller=mapper.go:47 msg="cleaning up mapped rules directory" path=/var/loki/rules-temp
level=info ts=2024-06-04T11:19:46.24931573Z caller=module_service.go:82 msg=starting module=server
level=info ts=2024-06-04T11:19:46.249441996Z caller=module_service.go:82 msg=starting module=analytics
level=info ts=2024-06-04T11:19:46.249450742Z caller=module_service.go:82 msg=starting module=runtime-config
level=info ts=2024-06-04T11:19:46.249718437Z caller=module_service.go:82 msg=starting module=bloom-store
level=info ts=2024-06-04T11:19:46.249785472Z caller=module_service.go:82 msg=starting module=memberlist-kv
level=info ts=2024-06-04T11:19:46.249803202Z caller=module_service.go:82 msg=starting module=index-gateway-ring
level=info ts=2024-06-04T11:19:46.249934334Z caller=module_service.go:82 msg=starting module=compactor
level=info ts=2024-06-04T11:19:46.250116788Z caller=module_service.go:82 msg=starting module=query-scheduler-ring
level=info ts=2024-06-04T11:19:46.250202472Z caller=module_service.go:82 msg=starting module=ring
level=error ts=2024-06-04T11:19:46.250916073Z caller=loki.go:519 msg="module failed" module=ring error="starting module ring: invalid service state: Failed, expected: Running, failure: unable to initialise ring state: Get \"http://localhost:8500/v1/kv/collectors/ring?stale=\": dial tcp [::1]:8500: connect: connection refused"

DylanGuedes commented 5 months ago

Sooo... any solution for EKS on IPv6 for this issue? I've tried setting ::1 manually,

loki:
  commonConfig:
    ring:
      instance_addr: ::1

but then I'm getting

level=info ts=2024-06-04T11:19:46.231869827Z caller=main.go:120 msg="Starting Loki" version="(version=3.0.0, branch=HEAD, revision=b4f7181c7a)"
level=info ts=2024-06-04T11:19:46.232898195Z caller=server.go:354 msg="server listening on addresses" http=[::]:3100 grpc=[::]:9095
level=info ts=2024-06-04T11:19:46.233751682Z caller=modules.go:730 component=bloomstore msg="no metas cache configured"
level=info ts=2024-06-04T11:19:46.233866322Z caller=blockscache.go:420 component=bloomstore msg="run ttl evict job"
level=info ts=2024-06-04T11:19:46.233914608Z caller=blockscache.go:380 component=bloomstore msg="run lru evict job"
level=info ts=2024-06-04T11:19:46.233925012Z caller=blockscache.go:365 component=bloomstore msg="run metrics collect job"
level=info ts=2024-06-04T11:19:46.243431409Z caller=table_manager.go:273 index-store=tsdb-2024-04-01 msg="query readiness setup completed" duration=3.134µs distinct_users_len=0 distinct_users=
level=info ts=2024-06-04T11:19:46.243486727Z caller=shipper.go:160 index-store=tsdb-2024-04-01 msg="starting index shipper in RO mode"
level=info ts=2024-06-04T11:19:46.244504387Z caller=mapper.go:47 msg="cleaning up mapped rules directory" path=/var/loki/rules-temp
level=info ts=2024-06-04T11:19:46.24931573Z caller=module_service.go:82 msg=starting module=server
level=info ts=2024-06-04T11:19:46.249441996Z caller=module_service.go:82 msg=starting module=analytics
level=info ts=2024-06-04T11:19:46.249450742Z caller=module_service.go:82 msg=starting module=runtime-config
level=info ts=2024-06-04T11:19:46.249718437Z caller=module_service.go:82 msg=starting module=bloom-store
level=info ts=2024-06-04T11:19:46.249785472Z caller=module_service.go:82 msg=starting module=memberlist-kv
level=info ts=2024-06-04T11:19:46.249803202Z caller=module_service.go:82 msg=starting module=index-gateway-ring
level=info ts=2024-06-04T11:19:46.249934334Z caller=module_service.go:82 msg=starting module=compactor
level=info ts=2024-06-04T11:19:46.250116788Z caller=module_service.go:82 msg=starting module=query-scheduler-ring
level=info ts=2024-06-04T11:19:46.250202472Z caller=module_service.go:82 msg=starting module=ring
level=error ts=2024-06-04T11:19:46.250916073Z caller=loki.go:519 msg="module failed" module=ring error="starting module ring: invalid service state: Failed, expected: Running, failure: unable to initialise ring state: Get \"http://localhost:8500/v1/kv/collectors/ring?stale=\": dial tcp [::1]:8500: connect: connection refused"

i remember someone having success on AWS by adding 127.0.0.1 to the instance_addr. Have you tried that? Regarding the error you're facing now:

level=error ts=2024-06-04T11:19:46.250916073Z caller=loki.go:519 msg="module failed" module=ring error="starting module ring: invalid service state: Failed, expected: Running, failure: unable to initialise ring state: Get \"http://localhost:8500/v1/kv/collectors/ring?stale=\": dial tcp [::1]:8500: connect: connection refused"

It is using port 8500, that seems wrong. Memberlist by default runs on a different port. I suggest you to make sure your components are serving memberlist on the same port you're configuring its usage/client.

vladmiller commented 5 months ago

@DylanGuedes thanks for the feedback. I have eventually realized that loki can't get the IP in k8s from the interface, because (a) eh0 and eth0 are not in k8s pods, but also (b) even if I provide interfaces that seem to exist in pod it still fails to acquire an IP address. So, instance_addr set to ::0 in my case IPv6 coz cluster is IPv6 works.

Now, port 8500 seems to be either etcd or consul. Seems like loki is trying to use these services for ring configuration. I had to manually set kvstore.store to memberlist.

Eventually, I realized that for my use case I can just run SingleBinary mode and have 2 replicas. Here is the config that works for me.

deploymentMode: SingleBinary

loki:
  auth_enabled: false

  commonConfig:
    ring:
      instance_addr: "::0"
      kvstore:
        store: inmemory
    replication_factor: 1
    path_prefix: /var/loki

  server:
    http_listen_port: 3100
    grpc_listen_port: 9095

  schemaConfig:
    configs:
      - from: 2024-04-01
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  ingester:
    chunk_encoding: snappy
  tracing:
    enabled: true
  # querier:
  #   # Default is 4, if you have enough memory and CPU you can increase, reduce if OOMing
  #   max_concurrent: 4

  storage:
    type: 's3'
    bucketNames:
      chunks: loki-logs-.....
      ruler: loki-logs-.....
      admin: loki-logs-.....
    s3:
      region: us-east-1

gateway:
  enabled: true
  replicas: 1
  resources: 
    limits:
      memory: 96Mi

singleBinary:
  replicas: 2
  autoscaling:
    enabled: true
  persistence:
    enabled: true
    size: 4Gi
    storageClass: io2
  limits:
    memory: 256Mi
read:
  replicas: 0
backend:
  replicas: 0
write:
  replicas: 0

chunksCache:
  enabled: true

resultsCache:
  enabled: true

lokiCanary:
  enabled: false
  resources: 
    limits:
      memory: 32Mi

test:
  enabled: false

For all the future people struggling with the same issue

By default in k8s cluster loki can't figure out it's own IP. Therefore you need to set loki.commonConfig.ring.instance_addr to local IP (127.0.0.1 or ::1 for IPv6)
One important thing for loki is to be able to somehow to know about other loki components. That's what's ring is about.
- Loki – by default in their helm chart – attempts to use consul or etcd for service discovery. That what threw me off. You can set loki.commonConfig.ring.kvstore.store=memberlist
- Also besides kvstore.store there is also Loki member_list config: https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml#L138-L152
- This is set by default to the proper service name; but if you see ring connection issues this could potentially because loki is not connecting to other nodes.
Loki takes it's sweet time to do ring discovery; in my logs I'd see ring discovery errors that would go away in a minute or so.

But, as far as I understand, if you don't have many logs (saw 100GB/day somewhere – unverified), you can use SingleBinary mode to make things easier.

Also chunksCache and resultsCache are simple memcached services; however, they request a looooot of ram by default, so... pointing out.

If you're using SingleBinary, you also need to set replication_factor to 1. Otherwise Loki will complain that there is not enough replicas.

grafana / loki

Loki SSD fails if interface is not eth0 or en0 #4948

For all the future people struggling with the same issue