simple scalable deployment: rpc error: code = Unimplemented desc = unknown service logproto.Querier

tanzerm commented 2 years ago

Describe the bug I want to run loki as simple scalable deployment on two aws ec2 instances.

node1: loki-read -target=read
node2: loki -target=write

To Reproduce Steps to reproduce the behavior:

On loki-read node:

# curl http://localhost:3100/loki/api/v1/labels

rpc error: code = Unimplemented desc = unknown service logproto.Querier

Environment:

Infrastructure: aws ec2 instances
loki version: 2.4.2

loki config

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

memberlist:
  join_members:
    - loki-read:7946

common:
  ring:
    kvstore:
      store: memberlist
  replication_factor: 1

ingester:
  wal:
    dir: /var/tmp/loki/wal
    replay_memory_ceiling: 1GB
  chunk_idle_period: 3h
  chunk_target_size: 1536000
  chunk_encoding: snappy
  max_chunk_age: 6h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/index_cache
    shared_store: s3
  aws:
     s3: s3://$LOC/$BUCKET

compactor:
  working_directory: /loki/boltdb-shipper-compactor
  shared_store: s3

schema_config:
  configs:
    - from: 2020-10-27
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: index_
        period: 24h

ruler:
  storage:
    type: local
    local:
      directory: /etc/loki/rules
  rule_path: /var/tmp/loki
  alertmanager_url: $URL
  enable_api: true

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

logs

Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.737254512Z caller=main.go:94 msg="Starting Loki" version="(version=2.4.2, branch=HEAD, revision=525040a32)"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.738515779Z caller=server.go:260 http=[::]:3100 grpc=[::]:9096 msg="server listening on addresses"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=warn ts=2022-03-08T15:38:00.74063594Z caller=util.go:168 msg="error getting interface" inf=eth0 err="route ip+net: no such network interface"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=warn ts=2022-03-08T15:38:00.7407595Z caller=util.go:168 msg="error getting interface" inf=en0 err="route ip+net: no such network interface"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=warn ts=2022-03-08T15:38:00.741847706Z caller=experimental.go:19 msg="experimental feature in use" feature="In-memory (FIFO) cache"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.742354058Z caller=memberlist_client.go:394 msg="Using memberlist cluster node name" name=loki-read-ip-10-11-12-224-64c0806d
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.742372158Z caller=shipper_index_client.go:108 msg="starting boltdb shipper in 1 mode"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.743054132Z caller=mapper.go:46 msg="cleaning up mapped rules directory" path=/var/tmp/loki
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=warn ts=2022-03-08T15:38:00.744037007Z caller=util.go:168 msg="error getting interface" inf=eth0 err="route ip+net: no such network interface"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=warn ts=2022-03-08T15:38:00.744300448Z caller=util.go:168 msg="error getting interface" inf=en0 err="route ip+net: no such network interface"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.745199213Z caller=worker.go:109 msg="Starting querier worker using query-scheduler and scheduler ring for addresses"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=warn ts=2022-03-08T15:38:00.745462344Z caller=net.go:19 msg="error getting interface" inf=eth0 err="route ip+net: no such network interface"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=warn ts=2022-03-08T15:38:00.745598165Z caller=net.go:19 msg="error getting interface" inf=en0 err="route ip+net: no such network interface"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.749809316Z caller=module_service.go:64 msg=initialising module=memberlist-kv
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.75047177Z caller=module_service.go:64 msg=initialising module=server
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.752807861Z caller=module_service.go:64 msg=initialising module=compactor
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.754643351Z caller=basic_lifecycler.go:252 msg="instance not found in the ring" instance=loki-read-ip-10-11-12-224 ring=compactor
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.754857362Z caller=basic_lifecycler_delegates.go:63 msg="not loading tokens from file, tokens file path is empty"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.755346484Z caller=module_service.go:64 msg=initialising module=query-scheduler
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.755665956Z caller=ring.go:281 msg="ring doesn't exist in KV store yet"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.756653411Z caller=module_service.go:64 msg=initialising module=ring
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.756768232Z caller=ring.go:281 msg="ring doesn't exist in KV store yet"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.756862982Z caller=module_service.go:64 msg=initialising module=query-frontend-tripperware
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.757946318Z caller=compactor.go:245 msg="waiting until compactor is JOINING in the ring"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.758294849Z caller=compactor.go:249 msg="compactor is JOINING in the ring"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.75834419Z caller=basic_lifecycler.go:252 msg="instance not found in the ring" instance=loki-read-ip-10-11-12-224 ring=scheduler
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.759282454Z caller=basic_lifecycler_delegates.go:63 msg="not loading tokens from file, tokens file path is empty"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.759193464Z caller=module_service.go:64 msg=initialising module=ingester-querier
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.760059228Z caller=scheduler.go:611 msg="waiting until scheduler is JOINING in the ring"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.760150579Z caller=scheduler.go:615 msg="scheduler is JOINING in the ring"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.761924758Z caller=module_service.go:64 msg=initialising module=store
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.762064078Z caller=module_service.go:64 msg=initialising module=ruler
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.762104679Z caller=ruler.go:442 msg="ruler up and running"
Mar 08 15:38:00 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:00.771023344Z caller=memberlist_client.go:506 msg="joined memberlist cluster" reached_nodes=1
Mar 08 15:38:01 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:01.76048578Z caller=compactor.go:259 msg="waiting until compactor is ACTIVE in the ring"
Mar 08 15:38:01 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:01.76055518Z caller=compactor.go:263 msg="compactor is ACTIVE in the ring"
Mar 08 15:38:01 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:01.761571435Z caller=scheduler.go:625 msg="waiting until scheduler is ACTIVE in the ring"
Mar 08 15:38:01 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:01.937830008Z caller=scheduler.go:629 msg="scheduler is ACTIVE in the ring"
Mar 08 15:38:01 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:01.93814587Z caller=module_service.go:64 msg=initialising module=querier
Mar 08 15:38:01 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:01.938323401Z caller=module_service.go:64 msg=initialising module=query-frontend
Mar 08 15:38:01 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:01.938512401Z caller=loki.go:355 msg="Loki started"
Mar 08 15:38:04 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:04.938839149Z caller=worker.go:205 msg="adding connection" addr=127.0.0.1:9096
Mar 08 15:38:04 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:04.939494612Z caller=scheduler.go:663 msg="this scheduler is in the ReplicationSet, will now accept requests."
Mar 08 15:38:06 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:06.761031196Z caller=compactor.go:305 msg="this instance has been chosen to run the compactor, starting compactor"
Mar 08 15:38:06 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:06.761161796Z caller=compactor.go:332 msg="waiting 10m0s for ring to stay stable and previous compactions to finish before starting compactor"
Mar 08 15:38:11 loki-read-ip-10-11-12-224 loki[3956]: level=info ts=2022-03-08T15:38:11.938711681Z caller=frontend_scheduler_worker.go:100 msg="adding connection to scheduler" addr=127.0.0.1:9096
Mar 08 15:38:20 loki-read-ip-10-11-12-224 loki[3956]: level=error ts=2022-03-08T15:38:20.465963145Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="rpc error: code = Code(500) desc = rpc error: code = Unimplemented desc = unknown service logproto.Querier\n"
Mar 08 15:38:20 loki-read-ip-10-11-12-224 loki[3956]: level=error ts=2022-03-08T15:38:20.467838584Z caller=retry.go:73 org_id=fake msg="error processing request" try=1 err="rpc error: code = Code(500) desc = rpc error: code = Unimplemented desc = unknown service logproto.Querier\n"
Mar 08 15:38:20 loki-read-ip-10-11-12-224 loki[3956]: level=error ts=2022-03-08T15:38:20.469216981Z caller=retry.go:73 org_id=fake msg="error processing request" try=2 err="rpc error: code = Code(500) desc = rpc error: code = Unimplemented desc = unknown service logproto.Querier\n"
Mar 08 15:38:20 loki-read-ip-10-11-12-224 loki[3956]: level=error ts=2022-03-08T15:38:20.47089233Z caller=retry.go:73 org_id=fake msg="error processing request" try=3 err="rpc error: code = Code(500) desc = rpc error: code = Unimplemented desc = unknown service logproto.Querier\n"
Mar 08 15:38:20 loki-read-ip-10-11-12-224 loki[3956]: level=error ts=2022-03-08T15:38:20.472750179Z caller=retry.go:73 org_id=fake msg="error processing request" try=4 err="rpc error: code = Code(500) desc = rpc error: code = Unimplemented desc = unknown service logproto.Querier\n"
Mar 08 15:38:20 loki-read-ip-10-11-12-224 loki[3956]: level=warn ts=2022-03-08T15:38:20.473098731Z caller=logging.go:72 traceID=680c0caabb557588 orgID=fake msg="GET /loki/api/v1/labels (500) 9.88322ms Response: \"rpc error: code = Unimplemented desc = unknown service logproto.Querier\\n\" ws: false; Accept: */*; User-Agent: curl/7.68.0; "

rgarcia6520 commented 2 years ago

Also seeing this issue deploying the latest 2.4.2 image of loki and version of the loki-simple-scalable helm chart with 2 write nodes and 2 read nodes pointing to AWS. Nodes are communicating and flushing logs, when pointing Grafana to Loki -read service as a Datasource I am getting this same error.

Loki: Internal Server Error. 500. rpc error: code = Unimplemented desc = unknown service logproto.Querier

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

rgarcia6520 commented 2 years ago

We were seeing this issue in our deployments and testing when attempting to use an AWS IAM Role attached to the nodes along with dynamodb/aws as the schema store. Migrating last week to feeding in AWS Access Credentials and boltdb-shipper with AWS S3 as the schema_config.object_store our searching/ingesting/everything is better and working as expected with no errors in write or read nodes.

DylanGuedes commented 2 years ago

keepalive

daniel-anova commented 2 years ago

We had a similar issue where one of our clusters stopped responding to any query with this error.

The only solution we found was helm uninstall followed by an install.

I couldn't really get any insights on what was going on, only that it was not the initial configuration as we had more clusters using it without issues.

DylanGuedes commented 2 years ago

We had a similar issue where one of our clusters stopped responding to any query with this error.

The only solution we found was helm uninstall followed by an install.

I couldn't really get any insights on what was going on, only that it was not the initial configuration as we had more clusters using it without issues.

Sorry to hear you faced this. About the issue: let us know if it happens again, there are a few things (configurations, mem/cpu dump, etc) that we can use to debug what caused it. Since it looks to be a transient/not common error, my intuition tells that it is caused by a write node being evaluated as a read node. Since a write node doesn't implement the querier interface, it would return that error.

A way of identifying such a thing is: access /ruler/ring page, /ring page and /distributor/ring page and double checking only the right nodes are registered/listed there: only read nodes should appear in the ruler ring page, and only write nodes should appear in the /ring, /distributor/ring pages.

daniel-anova commented 2 years ago

@DylanGuedes thanks for the advice, we'll be following those steps if we encounter the issue again and post our findings here.

stevehipwell commented 2 years ago

I've seen this issue when I installed from the Loki Distributed Helm chart into an IPv6 EKS cluster without the patch to make it work correctly with IPv6.

wg-sahassan commented 1 year ago

I am getting the same issue. 2 instances on AWS, one a read node and one is a write node. Both using an IAM Role for the credentials.

mike-ainsel commented 1 year ago

Hello, we are running Loki on AWS EKS 1.24 with IPv6 only, and we use the official Loki Helm chart for deployment. We had a similar issue with querying the read nodes. Unfortunately, log-level debug didn't provide any clues. We managed to solve it by adding instance_addr to the configuration.

common:
 replication_factor: 1
 instance_addr: "[${MY_POD_IP}]"
 ring:
  kvstore:
    store: memberlist
  instance_addr: "[${MY_POD_IP}]"

memberlist:
 bind_addr:
  - ${MY_POD_IP}
 join_members:
  - {{ include "loki.memberlist" . }}

extraArgs:
 - -config.expand-env=true

extraEnv:
 - name: MY_POD_IP
    valueFrom:
     fieldRef:
      fieldPath: status.podIP

hermes2000 commented 1 year ago

@mike-ainsel We're also running AWS EKS 1.24 with IPv6 only, using the loki-distributed Helm chart. I've been trying to get Loki to function with the help of your suggestions above, but have not been successful. Do you mind sharing your full values.yaml for your Helm chart deployment?

anjanmadaram commented 1 year ago

Hello @hermes2000, I'm facing same issue and we have similar environment, are you able to resolve it

mossad-zika commented 1 year ago

same problem ...

mike-ainsel commented 1 year ago

@mike-ainsel We're also running AWS EKS 1.24 with IPv6 only, using the loki-distributed Helm chart. I've been trying to get Loki to function with the help of your suggestions above, but have not been successful. Do you mind sharing your full values.yaml for your Helm chart deployment?

Hi, sorry for the late reply. I used these values with this helm chart: https://github.com/grafana/loki/tree/main/production/helm/loki

Helm values: values.yaml.zip

PsySuck commented 1 year ago

I have same issue. I'm trying to move from Monolithic mode to Simple scalable deployment mode. Without problems, it turns out to run in write + all targets, write + read do not work, and even read + all does not work. I checked /ruler/ringpage,/ringpage and/distributor/ring - everything is right.

We had a similar issue where one of our clusters stopped responding to any query with this error. The only solution we found was helm uninstall followed by an install. I couldn't really get any insights on what was going on, only that it was not the initial configuration as we had more clusters using it without issues.

Sorry to hear you faced this. About the issue: let us know if it happens again, there are a few things (configurations, mem/cpu dump, etc) that we can use to debug what caused it. Since it looks to be a transient/not common error, my intuition tells that it is caused by a write node being evaluated as a read node. Since a write node doesn't implement the querier interface, it would return that error.

A way of identifying such a thing is: access /ruler/ring page, /ring page and /distributor/ring page and double checking only the right nodes are registered/listed there: only read nodes should appear in the ruler ring page, and only write nodes should appear in the /ring, /distributor/ring pages.

DylanGuedes commented 1 year ago

I have same issue. I'm trying to move from Monolithic mode to Simple scalable deployment mode. Without problems, it turns out to run in write + all targets, write + read do not work, and even read + all does not work. I checked /ruler/ringpage,/ringpage and/distributor/ring - everything is right.

We had a similar issue where one of our clusters stopped responding to any query with this error. The only solution we found was helm uninstall followed by an install. I couldn't really get any insights on what was going on, only that it was not the initial configuration as we had more clusters using it without issues.

Sorry to hear you faced this. About the issue: let us know if it happens again, there are a few things (configurations, mem/cpu dump, etc) that we can use to debug what caused it. Since it looks to be a transient/not common error, my intuition tells that it is caused by a write node being evaluated as a read node. Since a write node doesn't implement the querier interface, it would return that error. A way of identifying such a thing is: access /ruler/ring page, /ring page and /distributor/ring page and double checking only the right nodes are registered/listed there: only read nodes should appear in the ruler ring page, and only write nodes should appear in the /ring, /distributor/ring pages.

read+write should definitely work, so something is wrong/unexpected there. Just double checking, you are running different nodes/components with either -target=read or -target=write, but not -target=read,write on the same node, right?

PsySuck commented 1 year ago

I am using -target=read or -target=write. I try to add loki read node to grafana and get this error: caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="rpc error: code = Code(500) desc = rpc error: code = Unimplemented desc = unknown service logproto.Querier\n" Just in case, should I add all the write and read nodes to the memberlist or depending on the Role memberlist should differ?

DylanGuedes commented 1 year ago

I am using -target=read or -target=write. I try to add loki read node to grafana and get this error: caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="rpc error: code = Code(500) desc = rpc error: code = Unimplemented desc = unknown service logproto.Querier\n" Just in case, should I add all the write and read nodes to the memberlist or depending on the Role memberlist should differ?

should be fine to use the same memberlist for all components, that shouldn't be the culprit.

PsySuck commented 1 year ago

I am using -target=read or -target=write. I try to add loki read node to grafana and get this error: caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="rpc error: code = Code(500) desc = rpc error: code = Unimplemented desc = unknown service logproto.Querier\n" Just in case, should I add all the write and read nodes to the memberlist or depending on the Role memberlist should differ?

should be fine to use the same memberlist for all components, that shouldn't be the culprit.

Maybe I'm missing something, but I can't find the reason. When switching from target=all to target=read, the node always throws this error. Debug level does not add information content. If I add node with targer=query-frontend - error appears on it.

DylanGuedes commented 1 year ago

can you share a docker-compose or something similar reproducing the problem?

PsySuck commented 1 year ago

I tried with this settings and loki runs as service. I used this config for both types, but I also removed unnecessary services from the config depending on the target during tests.

target: read
auth_enabled: false
memberlist:
  abort_if_cluster_join_fails: false

  # Expose this port on all distributor, ingester
  # and querier replicas.
  bind_port: 7946

  # You can use a headless k8s service for all distributor,
  # ingester and querier components.
  join_members:
  - node r1
  - node r2
  - node r3
  - node w1
  - node w2
  - node w3
  max_join_backoff: 1m
  max_join_retries: 10
  min_join_backoff: 1s
server:
  http_listen_port: 3100
  http_listen_address: 0.0.0.0
  http_server_read_timeout: 1000s
  http_server_write_timeout: 1000s
  http_server_idle_timeout: 1000s
  log_level: info
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 1300
  graceful_shutdown_timeout: 30s

distributor:
  ring:
    kvstore:
      store: memberlist

ingester:
  lifecycler:
    address: 0.0.0.0
    ring:
      kvstore:
        store: memberlist
    final_sleep: 10s
  max_transfer_retries: 2
  concurrent_flushes: 64
  wal:
    enabled: false

compactor:
  working_directory: /opt/loki/boltdb-shipper-compactor
  shared_store: s3
  retention_enabled: false
  max_compaction_parallelism: 4
  upload_parallelism: 10

limits_config:
  enforce_metric_name: false
  reject_old_samples: false
  reject_old_samples_max_age: 768h
  max_entries_limit_per_query: 20000
  max_streams_per_user: 0
  max_query_parallelism: 12
  per_stream_rate_limit: 30MB
  per_stream_rate_limit_burst: 50MB
  ingestion_rate_mb: 30
  ingestion_burst_size_mb: 40
  split_queries_by_interval: 60m
  max_chunks_per_query: 10000000
  max_query_length: 0
  deletion_mode: disabled
  max_query_series: 10000
  cardinality_limit: 500000
  query_timeout: 20m
chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0

storage_config:
  aws:
    s3: s3-config
    s3forcepathstyle: true
  boltdb_shipper:
    active_index_directory: /opt/loki/boltdb-shipper-active
    cache_location: /opt/loki/boltdb-shipper-cache
    cache_ttl: 12h
    shared_store: s3
  tsdb_shipper:
    active_index_directory: /opt/loki/tsdb-index
    cache_location: /opt/loki/tsdb-cache
    query_ready_num_days: 7
    shared_store: s3

schema_config:
  configs:
    - from: 2022-06-01
      store: boltdb-shipper
      object_store: aws
      schema: v12
      index:
        prefix: index_
        period: 24h
    - from: 2022-11-30
      store: tsdb
      object_store: aws
      schema: v12
      index:
        prefix: tsdb_index_
        period: 24h

querier:
  max_concurrent: 48

query_range:
   align_queries_with_step: true
   max_retries: 1
   cache_results: false

query_scheduler:
  max_outstanding_requests_per_tenant: 102400

frontend:
  log_queries_longer_than: 30s
  compress_responses: true
  max_outstanding_per_tenant: 102400

frontend_worker:
 grpc_client_config:
    max_send_msg_size: 104857600
    max_recv_msg_size: 104857600

ingester_client:
  remote_timeout: 120s
  pool_config:
    health_check_ingesters: true
    remote_timeout: 30s

ruler:
  rule_path: /tmp/rules
  enable_api: true
  enable_sharding: true
  ring:
    kvstore:
      store: memberlist
  wal:
    dir: /opt/loki/ruler-wal
  wal_cleaner:
    period: 24h
  storage:
    type: local
    local:
      directory: /opt/loki/rules/
  remote_write:
    enabled: true
    clients:
      p1:
        url: ":9090/api/v1/write"
      p2:
        url: ":9090/api/v1/write"

DylanGuedes commented 1 year ago

Could you try https://github.com/dylanGuedes/ssd-playground? it runs 3 read nodes+3 write nodes with memberlist and works pretty good.

PsySuck commented 1 year ago

Could you try https://github.com/dylanGuedes/ssd-playground? it runs 3 read nodes+3 write nodes with memberlist and works pretty good.

Wow. I changed memberlist now it is contain only write nodes:

memberlist:
  join_members:
  - loki-write1
  - loki-write2
  - loki-write3

and added

common:
  ring:
    kvstore:
      store: memberlist

After that it is works as expected.

DylanGuedes commented 1 year ago

Could you try https://github.com/dylanGuedes/ssd-playground? it runs 3 read nodes+3 write nodes with memberlist and works pretty good.

Wow. I changed memberlist now it is contain only write nodes:
memberlist:
  join_members:
  - loki-write1
  - loki-write2
  - loki-write3
and added
common:
  ring:
    kvstore:
      store: memberlist
After that it is works as expected.

Could you try one more time having one loki-read member on the join_members? I'll be surprised if that's what is wrong with your previous configuration.

PsySuck commented 1 year ago

Could you try https://github.com/dylanGuedes/ssd-playground? it runs 3 read nodes+3 write nodes with memberlist and works pretty good.

Wow. I changed memberlist now it is contain only write nodes:
memberlist:
  join_members:
  - loki-write1
  - loki-write2
  - loki-write3
and added
common:
  ring:
    kvstore:
      store: memberlist
After that it is works as expected.
Could you try one more time having one loki-read member on the join_members? I'll be surprised if that's what is wrong with your previous configuration.

It is works too. I started testing with only one read.

DylanGuedes commented 1 year ago

Could you try https://github.com/dylanGuedes/ssd-playground? it runs 3 read nodes+3 write nodes with memberlist and works pretty good.

Wow. I changed memberlist now it is contain only write nodes:
memberlist:
  join_members:
  - loki-write1
  - loki-write2
  - loki-write3
and added
common:
  ring:
    kvstore:
      store: memberlist
After that it is works as expected.
Could you try one more time having one loki-read member on the join_members? I'll be surprised if that's what is wrong with your previous configuration.
It is works too. I started testing with only one read.

Ahá, that's nice. So my hypothesis is that you had some rings using the default store (which is inmemory or consul, can't remember) instead of all of them using memberlist.

PsySuck commented 1 year ago

Thank you.

slappey-ibkr commented 1 year ago

I'm getting the same error. I'm unable to query the querier when target=read. I get the error above. It works when I set target=all. Should all read and write nodes be configured in memberlist or just write nodes? The documentation is a little lacking in this area. Also, how do I get the querier to query the ingesters? Does the querier learn about the ingesters over gossip? I'm unable to query logs until they are written to minio. I need to be able to query current logs.

DylanGuedes commented 1 year ago

I'm getting the same error. I'm unable to query the querier when target=read. I get the error above. It works when I set target=all. Should all read and write nodes be configured in memberlist or just write nodes?

all of them

The documentation is a little lacking in this area.

sorry to hear that. Do you mind opening an issue so we can tackle it?

Also, how do I get the querier to query the ingesters? Does the querier learn about the ingesters over gossip?

Queriers will find ingesters addresses through memberlist. Access /ring and /memberlist on your read and write nodes, all existing ingesters and clients should be listed there

I'm unable to query logs until they are written to minio. I need to be able to query current logs.

Yeah that's 100% related to a communication misconfigured between ingesters and queriers. That said, to unblock you, feel free to poke around with https://github.com/dylanGuedes/ssd-playground. It will demo you a setup with 3 reads and 3 write nodes.

slappey-ibkr commented 1 year ago

I was able to get this working by switching the network mode. We're running podman and originally had slirp4netns network mode. slirp4netns uses container network (10.0.2.0/24) with address translation and port translation. To get gossip working, I had to configure advertise_addr and advertise_port. I think the issue is that there is no way to advertise the grpc addr and grpc port, as they were different than the host. Once I switch network mode to host (no address translation and no port translation), everything started working. I'm good now.

gautampawnesh commented 12 months ago

facing same problem with the config: https://github.com/grafana/loki/blob/v2.8.3/production/docker/docker-compose.yaml

harshvardhan1402 commented 2 months ago

Hi Team, I am using grafana/loki:2.9.8 for HA setup and facing the same issue:- Here is loki config and docker-compose.yaml:-

loki-config:-

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  http_server_read_timeout: 10m
  http_server_write_timeout: 10m
  http_server_idle_timeout: 10m

memberlist:
  join_members:
    - loki:7946

ingester_client:
  remote_timeout: 60s
ingester:
  wal:
    enabled: false
  lifecycler:
    address: 0.0.0.0
    ring:
      kvstore:
        store: memberlist
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_target_size: 2048576  # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
  chunk_retain_period: 30s
  max_transfer_retries: 0

schema_config:
  configs:
  - from: 2020-05-15
    store: boltdb-shipper
    object_store: s3
    schema: v11
    index:
      prefix: index_
      period: 24h

table_manager:
  retention_deletes_enabled: true
  retention_period: 48h

storage_config:
  boltdb_shipper:
   active_index_directory: /loki/index
   cache_location: /loki/index_cache
   shared_store: s3
  aws:
    s3: s3://ap-south-1/*******
    s3forcepathstyle: true
    http_config:
      response_header_timeout: 20s

compactor:
  working_directory: /loki/boltdb-shipper-compactor
  shared_store: s3

chunk_store_config:
  max_look_back_period: 160h

limits_config:
        #enforce_metric_name: false
  ingestion_rate_mb: 250
  ingestion_burst_size_mb: 270
  split_queries_by_interval: 10m
  max_query_parallelism: 50
  per_stream_rate_limit: 64MB
  per_stream_rate_limit_burst: 200MB

version: "3"

networks:
  loki:

services:
  read:
    image: grafana/loki:2.9.8
    command: "-config.file=/etc/loki/config.yaml -target=read"
    ports:
      - 3100
      - 7946
      - 9095
    volumes:
      - ./loki-config.yaml:/etc/loki/config.yaml
    healthcheck:
      test: [ "CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3100/ready || exit 1" ]
      interval: 10s
      timeout: 5s
      retries: 5
    networks: &loki-dns
      loki:
        aliases:
          - loki

  write:
    image: grafana/loki:2.9.8
    command: "-config.file=/etc/loki/config.yaml -target=write"
    ports:
      - 3100
      - 7946
      - 9095
    volumes:
      - ./loki-config.yaml:/etc/loki/config.yaml
    healthcheck:
      test: [ "CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3100/ready || exit 1" ]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      <<: *loki-dns

  gateway:
    image: nginx:latest
    depends_on:
      - read
      - write
    entrypoint:
      - sh
      - -euc
      - |
        cat <<EOF > /etc/nginx/nginx.conf
        user  nginx;
        worker_processes  16;  ## Default: 1
        events {
          worker_connections   65535;
        }
        http {
          resolver 127.0.0.11;
          server {
            listen             3100;
            client_max_body_size 0;
            location = / {
              return 200 'OK';
              auth_basic off;
            }
            location = /api/prom/push {
              proxy_pass       http://write:3100\$$request_uri;
            }

            location = /api/prom/tail {
              proxy_pass       http://read:3100\$$request_uri;
              proxy_set_header Upgrade \$$http_upgrade;
              proxy_set_header Connection "upgrade";
            }

            location ~ /api/prom/.* {
              proxy_pass       http://read:3100\$$request_uri;
            }

            location = /loki/api/v1/push {
              proxy_pass       http://write:3100\$$request_uri;
            }

            location = /loki/api/v1/tail {
              proxy_pass       http://read:3100\$$request_uri;
              proxy_set_header Upgrade \$$http_upgrade;
              proxy_set_header Connection "upgrade";
            }

            location ~ /loki/api/.* {
              proxy_pass       http://read:3100\$$request_uri;
              proxy_read_timeout 90;
              proxy_connect_timeout 90;
              proxy_send_timeout 90;
            }

          }
        }
        EOF
        /docker-entrypoint.sh nginx -g "daemon off;"
    ports:
      - "3100:3100"
    networks:
      - loki

Please help where i am doing wrong

ab77 commented 3 weeks ago

tl;dr there is no logproto.Querier service on loki-read deployment(s), though it is available on loki-write.

grpcurl -plaintext \
  -H 'x-scope-orgid: 1' \
  -import-path proto \
  -proto logproto.proto \
  -d '{"selector":"{query=\"foo-bar\"}"}' \
  loki-read.default.svc.cluster.local:9095 \
  logproto.Querier/Query

ERROR:
  Code: Unimplemented
  Message: unknown service logproto.Querier

However, the service is published on loki-write:

grpcurl -plaintext \
  -H 'x-scope-orgid: 1' \
  -import-path proto \
  -proto logproto.proto \
  -d '{"selector":"{query=\"foo-bar\"}"}' \
  loki-write.default.svc.cluster.local:9095 \
  logproto.Querier/Query
{}

The reason for this, as we understand it, is that the logproto.Querier is required when "tailing" logs, not reading historical logs, so if you are running "tail" queries against Loki endpoints via gRPC, use "write" endpoint(s), otherwise point to "read".

For a full reproduction, to setup grpcurl:

wget https://github.com/fullstorydev/grpcurl/releases/download/v1.9.1/grpcurl_1.9.1_linux_arm64.deb
dpkg -i grpcurl_1.9.1_linux_arm64.deb
git clone https://github.com/balena-io-modules/node-loki-grpc-client.git
pushd node-loki-grpc-client

grpcurl ...

grafana / loki

simple scalable deployment: rpc error: code = Unimplemented desc = unknown service logproto.Querier #5578