robinhood / faust

Python Stream Processing

Other

6.74k stars 535 forks source link

Best practices for deploying Faust agent using Kubernetes #623

Open vishal-kvn opened 4 years ago

vishal-kvn commented 4 years ago

Checklist

[ X] I have included information about relevant versions
[ ] I have verified that the issue persists when using the master branch of Faust.

Steps to reproduce

I am trying to deploy a Faust agent to production env using 2 pods. The agent consumes from a topic that has 6 partitions. After the deploy the agent runs until it receives a SIGTERM(15) and the agent shut downs and stops consuming messages.

I am wondering if there are any best practices around deploys using kubernetes.

Expected behavior

Agent gracefully handles the sigterm.

Actual behavior

App shuts down and stop consuming messages

Versions

Python version: 3.7
Faust version: 1.10.4

afausti commented 4 years ago

@vishal-kvn I use a k8s deployment to run the Faust workers. I have configured the Faust app to auto discover the agents and the workers run indefinitely. This set up works fine to me.

vishal-kvn commented 4 years ago

@afausti Thanks for the reply. I will try it out.

vishal-kvn commented 4 years ago

@afausti Setting autodiscover=True did not fix the above issue. Also, I noticed that you set the replicaCount to 1(https://github.com/lsst-sqre/charts/blob/master/charts/kafka-aggregator/values.yaml#L3) for your worker. Have you deployed with a replicaCount greater than 1? For my use case I have a replicaCount of 3 but I noticed that only 1 worker(pod) is consuming messages. Please let me know if you came across this behavior.

bobh66 commented 4 years ago

A couple of questions:

How many partitions do you have on your topic? You need at minimum one partition per worker
Have you run "kubectl describe" on the pod after it is killed to get the status/event information? That should tell you why K8S is killing the pod
Do you have a readinessProbe and/or livenessProbe configured?
Are you allocating enough memory for the pods? OOMKilled is a very common reason for pods to get killed

Kubernetes will tell you what it doesn't like, you just need to look hard for it.

Hope this helps

vishal-kvn commented 4 years ago

@bobh66 Thanks for the reply.

How many partitions do you have on your topic? You need at minimum one partition per worker I have one topic that has 6 partitions.
Have you run "kubectl describe" on the pod after it is killed to get the status/event information? That should tell you why K8S is killing the pod I will be looking into this and will share more info.
Do you have a readinessProbe and/or livenessProbe configured? Yes. The pods pass the livenessProbe check.
Are you allocating enough memory for the pods? OOMKilled is a very common reason for pods to get killed I haven't seen a OOMKilled error in the logs and I have provisioned sufficient memory for the deploy.
Kubernetes will tell you what it doesn't like, you just need to look hard for it. Ack! I will take a closer look at the logs to find the root cause.

taybin commented 3 years ago

@afausti I see you're using the memory storage for Tables. Do you think you'd need to use a StatefulSet instead of a Deployment if you switched to rocksdb?

muaaaz commented 2 years ago

@taybin have you tried implementing a StatefulSet for Faust when using Rocksdb?

burbma commented 1 year ago

@vishal-kvn My Faust app is also getting a sigterm 15, though I'm running via docker-compose, not k8s. I'm wondering if this ever went anywhere for you?