lyft / flinkk8soperator

Kubernetes operator that provides control plane for managing Apache Flink applications
Apache License 2.0
563 stars 159 forks source link

Exception occurred in REST handler: Job X not found #256

Open liad5h opened 2 years ago

liad5h commented 2 years ago

Hey,

I am using the operator in version docker.io/lyft/flinkk8soperator:1355d206b5fb4efd6f6e4ccf24085a87a29443c5. Running ok aws eks version 1.21.

Sometimes The job manager floods the log with this message and when it starts, I am unable to redeploy the flinkapp without reaching the "DeployFailed" state

log: 2022-07-04 06:03:35,466 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception occurred in REST handler: Job <HASH> not found

at the same time, task manager does not have any logs in it (makes sense)

in the operator logs I see the below log for multiple flink apps: {"json":{"app_name":"esp-process-666","ns":"int-streaming","phase":"Running"},"level":"warning","msg":"Failed to reconcile resource <NAMESPACE>/<APP NAME>: GetJobOverview call failed with status 404 Not Found and message ''","ts":"2022-07-04T06:08:35Z"}

is this a known issue? how do I recover from this without deleting and redeploying the flink app?

L-LGL commented 1 year ago

I have a similar problem and wish I had an answer sooner

lydian commented 1 year ago

I also have the similar issue.

liad5h commented 1 year ago

I tried to fix this by enabling kubernetes HA but then i had other issues with checkpoints.

i ended up replacing this operator with this operator https://github.com/apache/flink-kubernetes-operator. So far it works much better