Closed Robsta86 closed 1 year ago
Weird timing because I just ran into this same issue today too and would very much like to hear an answer. Is the only solution for EKS getting the EBS CSI driver setup?
Not sure if this is going to resolve the complete problem, but in the schema config the object store should be s3 instead of aws I think.
Not sure if this is going to resolve the complete problem, but in the schema config the object store should be s3 instead of aws I think.
Unfortunately this wont resolve the problem and still everything is pending, waiting for a persistent volume that we cannot provision :)
Loki's primary storage is object storage, however it does use a disk for a few things:
For the highest guarantees around not losing any data, you would need to have a persistent volume for these at a minimum on the ingesters (or write) components.
For other components like compactor, index-gateway. This could be ephemeral disk.
If you use ephemeral disk for the components on the write path, your only durability would be provided by the replication factor. So with a rep factor of 3, losing more than 1 disk would result in some data loss of recent data.
Events: Type Reason Age From Message
Warning FailedScheduling 4m46s (x105 over 17h) default-scheduler running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition any solution for this
We are in the same situation as described here above. Is persistent storage needed for running Loki or can you have a fully working setup of Loki in simple scalable mode only using S3?
you can, it took be way to long to get it going.
Granted i have not full stress tested it. but confirmed after a day a reboot is able to see the last days logs and the s3 bucket is populated so good enough for my POC for now
Keep in mind the tmpdir is used so you loose ~2 hours of data given this on reboot. but this is adjustable via the compactor i think
hopefully it helps i think the important chucks are
loki:
## handles the s3 magic
storage:
type: s3
bucketNames:
chunks: ${bucket}
ruler: ${bucket}
s3:
region: ${region}
bucketnames: ${bucket}
commonConfig:
# gotta set to tmp dir as no PV
path_prefix: /tmp/loki
replication_factor: 1
schemaConfig:
configs:
- from: 2021-05-12
store: boltdb-shipper
object_store: s3
schema: v12
index:
prefix: loki_index_
period: 24h
storage_config:
boltdb_shipper:
shared_store: s3
cache_ttl: 168h
aws:
region: ${region}
bucketnames: ${bucket}
singleBinary:
replicas: 1
persistence:
enabled: false
Below is my full helm values
serviceAccount:
create: false
name: ${service_account_name}
memberlist:
service:
publishNotReadyAddresses: true
loki:
storage:
type: s3
bucketNames:
chunks: ${bucket}
ruler: ${bucket}
s3:
region: ${region}
bucketnames: ${bucket}
commonConfig:
# gotta set to tmp dir
path_prefix: /tmp/loki
replication_factor: 1
auth_enabled: false
limits_config:
ingestion_rate_mb: 20
ingestion_burst_size_mb: 30
compactor:
apply_retention_interval: 1h
compaction_interval: 5m
retention_delete_worker_count: 500
retention_enabled: true
shared_store: s3
schemaConfig:
configs:
- from: 2021-05-12
store: boltdb-shipper
object_store: s3
schema: v12
index:
prefix: loki_index_
period: 24h
storage_config:
boltdb_shipper:
shared_store: s3
cache_ttl: 168h
aws:
region: ${region}
bucketnames: ${bucket}
server:
http_listen_port: 3100
grpc_server_max_recv_msg_size: 104857600 # 100 Mb
grpc_server_max_send_msg_size: 104857600 # 100 Mb
http_server_write_timeout: 310s
http_server_read_timeout: 310s
ingester_client:
grpc_client_config:
max_recv_msg_size: 104857600 # 100 Mb
max_send_msg_size: 104857600 # 100 Mb
service:
port: 80
targetPort: 3100
url: http://${domain}:{{ .Values.loki.service.port }}
readinessProbe:
httpGet:
path: /ready
port: http-metrics
initialDelaySeconds: 45
livenessProbe:
httpGet:
path: /ready
port: http-metrics
initialDelaySeconds: 45
## makes it monolith matching the old stack way
singleBinary:
replicas: 1
persistence:
enabled: false
monitoring:
selfMonitoring:
enabled: true
lokiCanary:
enabled: false
gateway:
enabled: false
test:
enabled: false
ingress:
enabled: true
annotations:
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/scheme: internal
# Use this annotation (which must match a service name) to route traffic to HTTP2 backends.
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80}]'
external-dns.alpha.kubernetes.io/hostname: ${domain}
alb.ingress.kubernetes.io/shield-advanced-protection: "true"
external-dns.alpha.kubernetes.io/ingress-hostname-source: annotation-only
kubernetes.io/ingress.class: alb
hosts:
- ${domain}
@phyzical Hey! I'm very much experiencing similar issues to what you have previously received by the looks of things. Here's the error which I receive below -
WebIdentityErr: failed to retrieve credentials caused by: SerializationError
I've got the IAM Role/Policy/Trust Relationship correctly configured including the EKS OIDC parameters.
Here's my Helm values - can you see anything which could do with tweaking please?
serviceAccount:
create: true
name: loki-sa
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::**********:role/loki-bucket-role"
loki:
auth_enabled: false
storage:
type: s3
s3:
endpoint: https://eu-west-2.s3.amazonaws.com
region: eu-west-2a
s3ForcePathStyle: false
insecure: false
bucketNames:
chunks: loki-logs
ruler: loki-logs
admin: loki-logs
ingress:
enabled: true
annotations:
cert-manager.io/cluster-issuer: clusterissuer
hosts:
- loki.test.com
schema_config:
configs:
- from: 2023-04-13
store: boltdb-shipper
object_store: s3
schema: v12
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
shared_store: s3
cache_ttl: 168h
aws:
region: eu-west-2
bucketnames: loki-logs
insecure: false
Do we have an update on this issue ?
We are still getting error in loki when we explicitly specify the endpoint . We need it to be set to s3 fips endpoint if we want to use it with Govcloud. Hence this is a blocker in loki.
Loki Config: |
bucketnames: loki-cluster-backend-1,loki-cluster-backend-2
insecure: false
region: us-west-2
s3forcepathstyle: false
endpoint: s3.us-west-2.amazonaws.com
{"caller":"table_manager.go:143","err":"WebIdentityErr: failed to retrieve credentials\ncaused by: SerializationError: failed to unmarshal error message\n\tstatus code: 405, request id: \ncaused by: UnmarshalError: failed to unmarshal error message\n\t00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 |<?xml version=\"1|\n00000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 |.0\" encoding=\"UT|\n00000020 46 2d 38 22 3f 3e 0a 3c 45 72 72 6f 72 3e 3c 43 |F-8\"?>.
Hi,
I am trying to deploy the grafana loki helmchart (version 5.0.0) on our EKS cluster, and I noticed the read and write pods are stuck in the "pending" state.
After doing some research I figured out that the pods are stuck in the pending state because they are reliant on persistent storage. Since we have no persistent storage configured in our EKS cluster (And we are not intending to do so) the pvc's for these mentioned pods are also stuck in the pending state:
This is the values.yaml I used to deploy the chart:
I was under the impression that when S3 is configured as a storage backend there is no need to use persistent storage within the cluster. Did I misconfigure something? Is there a bug in the helm chart? Or.... is this expected behavior?
If the latter is the case it is actually a bit confusing since the loki-distributed helm chart (which is no longer recommended according to a Grafana Loki Configuration webinar I just watched) has no need for persistent storage within kubernetes. And the same goes for the lok-stack helm chart.