Open tanzerm opened 2 years ago
Also seeing this issue deploying the latest 2.4.2
image of loki and version of the loki-simple-scalable
helm chart with 2 write nodes and 2 read nodes pointing to AWS. Nodes are communicating and flushing logs, when pointing Grafana to Loki -read service as a Datasource I am getting this same error.
Loki: Internal Server Error. 500. rpc error: code = Unimplemented desc = unknown service logproto.Querier
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
revivable
if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).keepalive
label to silence the stalebot if the issue is very common/popular/important.We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
We were seeing this issue in our deployments and testing when attempting to use an AWS IAM Role attached to the nodes along with dynamodb/aws as the schema store. Migrating last week to feeding in AWS Access Credentials and boltdb-shipper
with AWS S3 as the schema_config.object_store
our searching/ingesting/everything is better and working as expected with no errors in write or read nodes.
keepalive
We had a similar issue where one of our clusters stopped responding to any query with this error.
The only solution we found was helm uninstall followed by an install.
I couldn't really get any insights on what was going on, only that it was not the initial configuration as we had more clusters using it without issues.
We had a similar issue where one of our clusters stopped responding to any query with this error.
The only solution we found was helm uninstall followed by an install.
I couldn't really get any insights on what was going on, only that it was not the initial configuration as we had more clusters using it without issues.
Sorry to hear you faced this. About the issue: let us know if it happens again, there are a few things (configurations, mem/cpu dump, etc) that we can use to debug what caused it. Since it looks to be a transient/not common error, my intuition tells that it is caused by a write node being evaluated as a read node. Since a write node doesn't implement the querier interface, it would return that error.
A way of identifying such a thing is: access /ruler/ring
page, /ring
page and /distributor/ring
page and double checking only the right nodes are registered/listed there: only read
nodes should appear in the ruler ring page, and only write nodes should appear in the /ring, /distributor/ring
pages.
@DylanGuedes thanks for the advice, we'll be following those steps if we encounter the issue again and post our findings here.
I've seen this issue when I installed from the Loki Distributed Helm chart into an IPv6 EKS cluster without the patch to make it work correctly with IPv6.
I am getting the same issue. 2 instances on AWS, one a read node and one is a write node. Both using an IAM Role for the credentials.
Hello, we are running Loki on AWS EKS 1.24 with IPv6 only, and we use the official Loki Helm chart for deployment. We had a similar issue with querying the read nodes. Unfortunately, log-level debug didn't provide any clues. We managed to solve it by adding instance_addr to the configuration.
common:
replication_factor: 1
instance_addr: "[${MY_POD_IP}]"
ring:
kvstore:
store: memberlist
instance_addr: "[${MY_POD_IP}]"
memberlist:
bind_addr:
- ${MY_POD_IP}
join_members:
- {{ include "loki.memberlist" . }}
extraArgs:
- -config.expand-env=true
extraEnv:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
@mike-ainsel We're also running AWS EKS 1.24 with IPv6 only, using the loki-distributed
Helm chart. I've been trying to get Loki to function with the help of your suggestions above, but have not been successful. Do you mind sharing your full values.yaml
for your Helm chart deployment?
Hello @hermes2000, I'm facing same issue and we have similar environment, are you able to resolve it
same problem ...
@mike-ainsel We're also running AWS EKS 1.24 with IPv6 only, using the
loki-distributed
Helm chart. I've been trying to get Loki to function with the help of your suggestions above, but have not been successful. Do you mind sharing your fullvalues.yaml
for your Helm chart deployment?
Hi, sorry for the late reply. I used these values with this helm chart: https://github.com/grafana/loki/tree/main/production/helm/loki
Helm values: values.yaml.zip
I have same issue.
I'm trying to move from Monolithic mode to Simple scalable deployment mode. Without problems, it turns out to run in write + all targets, write + read do not work, and even read + all does not work.
I checked /ruler/ringpage,
/ringpage and
/distributor/ring - everything is right.
We had a similar issue where one of our clusters stopped responding to any query with this error. The only solution we found was helm uninstall followed by an install. I couldn't really get any insights on what was going on, only that it was not the initial configuration as we had more clusters using it without issues.
Sorry to hear you faced this. About the issue: let us know if it happens again, there are a few things (configurations, mem/cpu dump, etc) that we can use to debug what caused it. Since it looks to be a transient/not common error, my intuition tells that it is caused by a write node being evaluated as a read node. Since a write node doesn't implement the querier interface, it would return that error.
A way of identifying such a thing is: access
/ruler/ring
page,/ring
page and/distributor/ring
page and double checking only the right nodes are registered/listed there: onlyread
nodes should appear in the ruler ring page, and only write nodes should appear in the/ring, /distributor/ring
pages.
I have same issue. I'm trying to move from Monolithic mode to Simple scalable deployment mode. Without problems, it turns out to run in write + all targets, write + read do not work, and even read + all does not work. I checked /ruler/ring
page,
/ringpage and
/distributor/ring - everything is right.We had a similar issue where one of our clusters stopped responding to any query with this error. The only solution we found was helm uninstall followed by an install. I couldn't really get any insights on what was going on, only that it was not the initial configuration as we had more clusters using it without issues.
Sorry to hear you faced this. About the issue: let us know if it happens again, there are a few things (configurations, mem/cpu dump, etc) that we can use to debug what caused it. Since it looks to be a transient/not common error, my intuition tells that it is caused by a write node being evaluated as a read node. Since a write node doesn't implement the querier interface, it would return that error. A way of identifying such a thing is: access
/ruler/ring
page,/ring
page and/distributor/ring
page and double checking only the right nodes are registered/listed there: onlyread
nodes should appear in the ruler ring page, and only write nodes should appear in the/ring, /distributor/ring
pages.
read
+write
should definitely work, so something is wrong/unexpected there. Just double checking, you are running different nodes/components with either -target=read
or -target=write
, but not -target=read,write
on the same node, right?
I am using -target=read or -target=write.
I try to add loki read node to grafana and get this error:
caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="rpc error: code = Code(500) desc = rpc error: code = Unimplemented desc = unknown service logproto.Querier\n"
Just in case, should I add all the write and read nodes to the memberlist or depending on the Role memberlist should differ?
I am using -target=read or -target=write. I try to add loki read node to grafana and get this error:
caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="rpc error: code = Code(500) desc = rpc error: code = Unimplemented desc = unknown service logproto.Querier\n"
Just in case, should I add all the write and read nodes to the memberlist or depending on the Role memberlist should differ?
should be fine to use the same memberlist for all components, that shouldn't be the culprit.
I am using -target=read or -target=write. I try to add loki read node to grafana and get this error:
caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="rpc error: code = Code(500) desc = rpc error: code = Unimplemented desc = unknown service logproto.Querier\n"
Just in case, should I add all the write and read nodes to the memberlist or depending on the Role memberlist should differ?should be fine to use the same memberlist for all components, that shouldn't be the culprit.
Maybe I'm missing something, but I can't find the reason. When switching from target=all to target=read, the node always throws this error. Debug level does not add information content. If I add node with targer=query-frontend - error appears on it.
can you share a docker-compose or something similar reproducing the problem?
I tried with this settings and loki runs as service. I used this config for both types, but I also removed unnecessary services from the config depending on the target during tests.
target: read
auth_enabled: false
memberlist:
abort_if_cluster_join_fails: false
# Expose this port on all distributor, ingester
# and querier replicas.
bind_port: 7946
# You can use a headless k8s service for all distributor,
# ingester and querier components.
join_members:
- node r1
- node r2
- node r3
- node w1
- node w2
- node w3
max_join_backoff: 1m
max_join_retries: 10
min_join_backoff: 1s
server:
http_listen_port: 3100
http_listen_address: 0.0.0.0
http_server_read_timeout: 1000s
http_server_write_timeout: 1000s
http_server_idle_timeout: 1000s
log_level: info
grpc_server_max_recv_msg_size: 104857600
grpc_server_max_send_msg_size: 104857600
grpc_server_max_concurrent_streams: 1300
graceful_shutdown_timeout: 30s
distributor:
ring:
kvstore:
store: memberlist
ingester:
lifecycler:
address: 0.0.0.0
ring:
kvstore:
store: memberlist
final_sleep: 10s
max_transfer_retries: 2
concurrent_flushes: 64
wal:
enabled: false
compactor:
working_directory: /opt/loki/boltdb-shipper-compactor
shared_store: s3
retention_enabled: false
max_compaction_parallelism: 4
upload_parallelism: 10
limits_config:
enforce_metric_name: false
reject_old_samples: false
reject_old_samples_max_age: 768h
max_entries_limit_per_query: 20000
max_streams_per_user: 0
max_query_parallelism: 12
per_stream_rate_limit: 30MB
per_stream_rate_limit_burst: 50MB
ingestion_rate_mb: 30
ingestion_burst_size_mb: 40
split_queries_by_interval: 60m
max_chunks_per_query: 10000000
max_query_length: 0
deletion_mode: disabled
max_query_series: 10000
cardinality_limit: 500000
query_timeout: 20m
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0
storage_config:
aws:
s3: s3-config
s3forcepathstyle: true
boltdb_shipper:
active_index_directory: /opt/loki/boltdb-shipper-active
cache_location: /opt/loki/boltdb-shipper-cache
cache_ttl: 12h
shared_store: s3
tsdb_shipper:
active_index_directory: /opt/loki/tsdb-index
cache_location: /opt/loki/tsdb-cache
query_ready_num_days: 7
shared_store: s3
schema_config:
configs:
- from: 2022-06-01
store: boltdb-shipper
object_store: aws
schema: v12
index:
prefix: index_
period: 24h
- from: 2022-11-30
store: tsdb
object_store: aws
schema: v12
index:
prefix: tsdb_index_
period: 24h
querier:
max_concurrent: 48
query_range:
align_queries_with_step: true
max_retries: 1
cache_results: false
query_scheduler:
max_outstanding_requests_per_tenant: 102400
frontend:
log_queries_longer_than: 30s
compress_responses: true
max_outstanding_per_tenant: 102400
frontend_worker:
grpc_client_config:
max_send_msg_size: 104857600
max_recv_msg_size: 104857600
ingester_client:
remote_timeout: 120s
pool_config:
health_check_ingesters: true
remote_timeout: 30s
ruler:
rule_path: /tmp/rules
enable_api: true
enable_sharding: true
ring:
kvstore:
store: memberlist
wal:
dir: /opt/loki/ruler-wal
wal_cleaner:
period: 24h
storage:
type: local
local:
directory: /opt/loki/rules/
remote_write:
enabled: true
clients:
p1:
url: ":9090/api/v1/write"
p2:
url: ":9090/api/v1/write"
Could you try https://github.com/dylanGuedes/ssd-playground? it runs 3 read nodes+3 write nodes with memberlist and works pretty good.
Could you try https://github.com/dylanGuedes/ssd-playground? it runs 3 read nodes+3 write nodes with memberlist and works pretty good.
Wow. I changed memberlist now it is contain only write nodes:
memberlist:
join_members:
- loki-write1
- loki-write2
- loki-write3
and added
common:
ring:
kvstore:
store: memberlist
After that it is works as expected.
Could you try https://github.com/dylanGuedes/ssd-playground? it runs 3 read nodes+3 write nodes with memberlist and works pretty good.
Wow. I changed memberlist now it is contain only write nodes:
memberlist: join_members: - loki-write1 - loki-write2 - loki-write3
and added
common: ring: kvstore: store: memberlist
After that it is works as expected.
Could you try one more time having one loki-read
member on the join_members? I'll be surprised if that's what is wrong with your previous configuration.
Could you try https://github.com/dylanGuedes/ssd-playground? it runs 3 read nodes+3 write nodes with memberlist and works pretty good.
Wow. I changed memberlist now it is contain only write nodes:
memberlist: join_members: - loki-write1 - loki-write2 - loki-write3
and added
common: ring: kvstore: store: memberlist
After that it is works as expected.
Could you try one more time having one
loki-read
member on the join_members? I'll be surprised if that's what is wrong with your previous configuration.
It is works too. I started testing with only one read.
Could you try https://github.com/dylanGuedes/ssd-playground? it runs 3 read nodes+3 write nodes with memberlist and works pretty good.
Wow. I changed memberlist now it is contain only write nodes:
memberlist: join_members: - loki-write1 - loki-write2 - loki-write3
and added
common: ring: kvstore: store: memberlist
After that it is works as expected.
Could you try one more time having one
loki-read
member on the join_members? I'll be surprised if that's what is wrong with your previous configuration.It is works too. I started testing with only one read.
Ahá, that's nice. So my hypothesis is that you had some rings using the default store (which is inmemory
or consul
, can't remember) instead of all of them using memberlist.
Thank you.
I'm getting the same error. I'm unable to query the querier when target=read. I get the error above. It works when I set target=all. Should all read and write nodes be configured in memberlist or just write nodes? The documentation is a little lacking in this area. Also, how do I get the querier to query the ingesters? Does the querier learn about the ingesters over gossip? I'm unable to query logs until they are written to minio. I need to be able to query current logs.
I'm getting the same error. I'm unable to query the querier when target=read. I get the error above. It works when I set target=all. Should all read and write nodes be configured in memberlist or just write nodes?
all of them
The documentation is a little lacking in this area.
sorry to hear that. Do you mind opening an issue so we can tackle it?
Also, how do I get the querier to query the ingesters? Does the querier learn about the ingesters over gossip?
Queriers will find ingesters addresses through memberlist. Access /ring
and /memberlist
on your read and write nodes, all existing ingesters and clients should be listed there
I'm unable to query logs until they are written to minio. I need to be able to query current logs.
Yeah that's 100% related to a communication misconfigured between ingesters and queriers. That said, to unblock you, feel free to poke around with https://github.com/dylanGuedes/ssd-playground. It will demo you a setup with 3 reads and 3 write nodes.
I was able to get this working by switching the network mode. We're running podman and originally had slirp4netns network mode. slirp4netns uses container network (10.0.2.0/24) with address translation and port translation. To get gossip working, I had to configure advertise_addr and advertise_port. I think the issue is that there is no way to advertise the grpc addr and grpc port, as they were different than the host. Once I switch network mode to host (no address translation and no port translation), everything started working. I'm good now.
facing same problem with the config: https://github.com/grafana/loki/blob/v2.8.3/production/docker/docker-compose.yaml
Hi Team, I am using grafana/loki:2.9.8 for HA setup and facing the same issue:- Here is loki config and docker-compose.yaml:-
loki-config:-
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
http_server_read_timeout: 10m
http_server_write_timeout: 10m
http_server_idle_timeout: 10m
memberlist:
join_members:
- loki:7946
ingester_client:
remote_timeout: 60s
ingester:
wal:
enabled: false
lifecycler:
address: 0.0.0.0
ring:
kvstore:
store: memberlist
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 5m
chunk_target_size: 2048576 # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
chunk_retain_period: 30s
max_transfer_retries: 0
schema_config:
configs:
- from: 2020-05-15
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
table_manager:
retention_deletes_enabled: true
retention_period: 48h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/index_cache
shared_store: s3
aws:
s3: s3://ap-south-1/*******
s3forcepathstyle: true
http_config:
response_header_timeout: 20s
compactor:
working_directory: /loki/boltdb-shipper-compactor
shared_store: s3
chunk_store_config:
max_look_back_period: 160h
limits_config:
#enforce_metric_name: false
ingestion_rate_mb: 250
ingestion_burst_size_mb: 270
split_queries_by_interval: 10m
max_query_parallelism: 50
per_stream_rate_limit: 64MB
per_stream_rate_limit_burst: 200MB
version: "3"
networks:
loki:
services:
read:
image: grafana/loki:2.9.8
command: "-config.file=/etc/loki/config.yaml -target=read"
ports:
- 3100
- 7946
- 9095
volumes:
- ./loki-config.yaml:/etc/loki/config.yaml
healthcheck:
test: [ "CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3100/ready || exit 1" ]
interval: 10s
timeout: 5s
retries: 5
networks: &loki-dns
loki:
aliases:
- loki
write:
image: grafana/loki:2.9.8
command: "-config.file=/etc/loki/config.yaml -target=write"
ports:
- 3100
- 7946
- 9095
volumes:
- ./loki-config.yaml:/etc/loki/config.yaml
healthcheck:
test: [ "CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3100/ready || exit 1" ]
interval: 10s
timeout: 5s
retries: 5
networks:
<<: *loki-dns
gateway:
image: nginx:latest
depends_on:
- read
- write
entrypoint:
- sh
- -euc
- |
cat <<EOF > /etc/nginx/nginx.conf
user nginx;
worker_processes 16; ## Default: 1
events {
worker_connections 65535;
}
http {
resolver 127.0.0.11;
server {
listen 3100;
client_max_body_size 0;
location = / {
return 200 'OK';
auth_basic off;
}
location = /api/prom/push {
proxy_pass http://write:3100\$$request_uri;
}
location = /api/prom/tail {
proxy_pass http://read:3100\$$request_uri;
proxy_set_header Upgrade \$$http_upgrade;
proxy_set_header Connection "upgrade";
}
location ~ /api/prom/.* {
proxy_pass http://read:3100\$$request_uri;
}
location = /loki/api/v1/push {
proxy_pass http://write:3100\$$request_uri;
}
location = /loki/api/v1/tail {
proxy_pass http://read:3100\$$request_uri;
proxy_set_header Upgrade \$$http_upgrade;
proxy_set_header Connection "upgrade";
}
location ~ /loki/api/.* {
proxy_pass http://read:3100\$$request_uri;
proxy_read_timeout 90;
proxy_connect_timeout 90;
proxy_send_timeout 90;
}
}
}
EOF
/docker-entrypoint.sh nginx -g "daemon off;"
ports:
- "3100:3100"
networks:
- loki
Please help where i am doing wrong
tl;dr there is no
logproto.Querier
service onloki-read
deployment(s), though it is available onloki-write
.
grpcurl -plaintext \
-H 'x-scope-orgid: 1' \
-import-path proto \
-proto logproto.proto \
-d '{"selector":"{query=\"foo-bar\"}"}' \
loki-read.default.svc.cluster.local:9095 \
logproto.Querier/Query
ERROR:
Code: Unimplemented
Message: unknown service logproto.Querier
However, the service is published on loki-write
:
grpcurl -plaintext \
-H 'x-scope-orgid: 1' \
-import-path proto \
-proto logproto.proto \
-d '{"selector":"{query=\"foo-bar\"}"}' \
loki-write.default.svc.cluster.local:9095 \
logproto.Querier/Query
{}
The reason for this, as we understand it, is that the logproto.Querier
is required when "tailing" logs, not reading historical logs, so if you are running "tail" queries against Loki endpoints via gRPC, use "write" endpoint(s), otherwise point to "read".
For a full reproduction, to setup grpcurl
:
wget https://github.com/fullstorydev/grpcurl/releases/download/v1.9.1/grpcurl_1.9.1_linux_arm64.deb
dpkg -i grpcurl_1.9.1_linux_arm64.deb
git clone https://github.com/balena-io-modules/node-loki-grpc-client.git
pushd node-loki-grpc-client
grpcurl ...
Describe the bug I want to run loki as simple scalable deployment on two aws ec2 instances.
To Reproduce Steps to reproduce the behavior:
On loki-read node:
Environment:
loki config
logs