Open anomam opened 1 year ago
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
/area networking
Ask your question here:
Hi @dprotaso , as requested in our slack discussion, I'm creating an issue here. Thank you for your help!
issue
We’re using a KafkaSource + knative Service to run long-running jobs (let's say up to 20 minutes runtime per request), and sometimes the requests get retried without the service showing signs that the request failed. In fact, when requests are retried (and often processed by a different service pod), the originals eventually finish as well. Our system ends up with requests being dual-processed. Our services are implemented using Python FastAPI, and the container entrypoint is managed using dumb-init to get proper termination behavior. The service pods are not being restarted or anything (due to OOM errors), from our observation.
Now, the temporary fix (that’s not acceptable in production) is to fix the number of service pods. If there are no pods being terminated, the issue does not occur, and no requests are retried. This is causing us issues in production and we can't really figure out what could be behind it.
We're using version 1.7.1 of Knative serving and eventing.
Are there any potential causes that we should investigate?
reproducing the issue
We've been able to reproduce the issue with some good level of reliability when these conditions are met
The log from the activator
Cleaning up the error message and the stacktrace from the log above: