Refinery chart upgrade is failing

richamishra006 commented 5 months ago

I gave a refinery chart running perfectly fine on 2.0.0 version. The redis is installed separately and the hostname is referred in the values.yaml under RedisPeerManagement. As soon as i try to upgrade the chart even to 2.1.2, also tried for 2.9.0 and 2.9.1, it is failing with below error

2024/05/17 12:41:41 maxprocs: Updating GOMAXPROCS=2: determined from CPU quota
time="2024-05-17T12:41:41Z" level=info msg="using identifier from interface" identifier=10.1.1.236 interface=eth0
unable to load peers: dial tcp: lookup my-redis-master.redis.cluster.local on 10.96.0.10:53: no such host
time="2024-05-17T12:41:51Z" level=error msg="failed to register self with redis peer store" error="dial tcp: lookup my-redis-master.redis.cluster.local on 10.96.0.10:53: no such host"

can someone please help me here. Not sure what am i missing

TylerHelmuth commented 5 months ago

The refinery 2.0.0 helm chart (helm chart version refinery-2.0.0) had a bug where PeerManagement was local by default. Helm chart version refinery-2.1.0 fixed this via https://github.com/honeycombio/helm-charts/pull/267. Were you already setting the PeerManagement to Redis?

Can you share your values.yaml?

richamishra006 commented 5 months ago

yes I tried setting the PeerManagement to Redis, but still getting that error while upgrading.

config:
  Collection:
    AvailableMemory: '2GB'
  PeerManagement:
    Type: redis
  RedisPeerManagement:
    Host: 'my-redis-master.redis.cluster.local:6379'
    Timeout: 15s

redis:
  enabled: false

TylerHelmuth commented 5 months ago

The only thing I can think of is that in Refinery 2.0.2 we increased the redis scan batch size: https://github.com/honeycombio/refinery/releases/tag/v2.0.2.

I see this Redis instance isn't coming from the helm chart but is expected to be inside the cluster. How are you installing it? Are the IP address in the error message correct IPs? Is this problem happening on a helm install or a helm upgrade?

richamishra006 commented 5 months ago

We are facing this issue in EKS cluster and the redis is installed as elasticcache in aws. I replicated the same setup in local and there as well, getting same error. The redis is installed in same network and with 2.0.0., we are not facing connectivity issue. In my local minikube as well, I installed the redis in same cluster and with 2.0.0 it is working perfectly fine. The exact setup is working with 2.0.0. The helm chart is running on 2.0.0 and as soon as I upgrade it by running helm upgrade command, i tried for 2.1.2 and 2.9.0 as well, the error is same.

I think if you replicate the same, you would get this error.

TylerHelmuth commented 5 months ago

@richamishra006 I was able to reproduce the issue locally only when using an invalid redis peer host, such as 'refinery-redis.default.cluster.local:6379' instead of 'refinery-redis.default.svc.cluster.local:6379'. As long as I provided a valid host endpoint I was able to perform an upgrade with no issues. Definitely check that the endpoint you're providing is correct.

richamishra006 commented 5 months ago

Thanks for the quick response @TylerHelmuth . I missed the svc in redis endpoint in my local, after adding it, I am getting this error

$ kubectl logs my-refinery-6f66bfccf4-h2w2f
2024/05/18 03:22:00 maxprocs: Updating GOMAXPROCS=2: determined from CPU quota
time="2024-05-18T03:22:00Z" level=info msg="using identifier from interface" identifier=10.1.1.240 interface=eth0
time="2024-05-18T03:22:00Z" level=error msg="registration failed" err="NOAUTH Authentication required." name="http://10.1.1.240:8081" timeoutSec=10
time="2024-05-18T03:22:00Z" level=error msg="failed to register self with redis peer store" error="NOAUTH Authentication required."
unable to load peers: NOAUTH Authentication required.

However, I verified the endpoint of redis elasticcache is correct in my prod setup. And that's the reason it is working with 2.0.0 I am wondering how it was connecting with redis with incorrect endpoint in my local at 2.0.0 version

richamishra006 commented 5 months ago

@TylerHelmuth I upgraded the refinery helm chart to 2.9.0 and the redis (aws elasticache) connection issue is resolved. However, i am getting errors in refinery pod logs

time="2024-06-05T10:55:22Z" level=error msg="error when sending event" api_host="http://100.64.173.97:8081/" dataset=processor-indexing environment=honeycomb-perf error="got unexpected HTTP status 503: Service Unavailable" roundtrip_usec=111279 status_code=503 trace.span_id=6655e67890000

I searched with the trace.span_id in honeycomb ui, but could find this span id. Could you please help if we are missing anything

robbkidd commented 4 months ago

This issue is being handled as a support ticket for the operational environment.

honeycombio / helm-charts

Refinery chart upgrade is failing #364